[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T0000). Please do the needful. [00:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:19] I'll do the SWAT since I'm the only customer [00:00:35] i see now, i need to read more wmfall [00:01:06] "his March, we are celebrating our first WMF March Holiday " nice! [00:01:33] well, jouncebot, you are great [00:02:46] lol [00:03:45] 06Operations, 06Discovery, 10Traffic, 06Maps (Tilerator): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#1558879 (10Pnorman) I'd rather see `max-age` significantly reduced and `stale-while-revalidate` set to the current `max-age` value. This avoids the need to invalidate t... [00:04:03] when does the phab window begin? [00:04:50] twentyafterfour: [00:06:10] Hopefully not during SWAT? :D [00:09:54] mutante: the 'hoilday' is the made up WMF day off in March. there is an email about it on wmfall [00:10:31] mutante: in an hour [00:10:39] bd808: i just found out about that... thanks to the bot :) [00:10:58] jouncebot: refresh [00:11:06] I refreshed my knowledge about deployments. [00:11:10] it's nice to have a free Monday that is NOT a holiday for everybody [00:11:17] good for hotel/flights, heh [00:11:47] twentyafterfour: gotcha, thanks. i was thinking about the "phd to systemd" thing [00:13:45] (03PS1) 10Dzahn: remove fluorine from DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/341939 (https://phabricator.wikimedia.org/T159996) [00:15:45] 06Operations, 13Patch-For-Review: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#3086271 (10Dzahn) [00:15:47] (03PS1) 10Dzahn: mediawiki::logging: remove fluorine from firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/341940 (https://phabricator.wikimedia.org/T159996) [00:17:47] (03PS1) 10Dzahn: remove fluorine prod IP, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/341941 (https://phabricator.wikimedia.org/T159996) [00:18:02] (03PS2) 10Dzahn: remove fluorine prod IP, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/341941 (https://phabricator.wikimedia.org/T159996) [00:18:32] !log catrope@tin Synchronized php-1.29.0-wmf.15/extensions/Echo/modules/styles/mw.echo.ui.NotificationBadgeWidget.less: Fix RTL popup alignment (T159999) (duration: 00m 42s) [00:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:41] T159999: 1.29.0-wmf.15 regression: notification popup misaligned, partially off-screen in RTL - https://phabricator.wikimedia.org/T159999 [00:22:29] (03CR) 10Dzahn: "@Krinkle" [puppet] - 10https://gerrit.wikimedia.org/r/341789 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [00:23:08] (03CR) 10Krinkle: [C: 031] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/341940 (https://phabricator.wikimedia.org/T159996) (owner: 10Dzahn) [00:23:19] (03CR) 10Krinkle: [C: 031] site: use spare::system on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/341789 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [00:34:14] (03CR) 10Dzahn: [C: 032] Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [00:35:01] RoanKattouw: are you still swatting? [00:35:10] Sorry, no I'm done [00:35:27] ok, good timing then. i'm doing a phab change, right before the phab window starts [00:36:37] !log iridium - temp. disable puppet | phab1001 - converting service to base::service_unit (T137928) [00:36:43] 2001, [00:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:45] T137928: Deploy phabricator to phab2001.codfw.wmnet - https://phabricator.wikimedia.org/T137928 [00:36:57] paladox: you here? [00:37:00] yep [00:37:30] i was gonna say there is a problem with 2001, but there is not [00:37:53] Oh [00:38:29] it said: Service[phd]/enable: change from false to true failed: Could not enable phd: [00:38:34] but: [00:38:42] a) we disabled it there on purpose [00:38:51] b) for some reason that does not repeat itself on each puppet run [00:39:11] oh [00:39:30] it's because we used "mask" to disable it [00:39:41] i think [00:40:09] yep [00:40:18] unfortunately this shows as "failed" in the overview [00:40:26] which makes it look bad in Icinga [00:40:29] oh [00:40:36] but that isn't a change due to this [00:40:38] it was like it before [00:40:47] it's just been ACKed [00:41:08] now we get to iridium ... [00:41:54] !log iridium - re-enable puppet, convert to base::service unit, phd restarting [00:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:02] oh [00:42:04] :) [00:42:28] Notice: /Stage[main]/Phabricator/Base::Service_unit[phd]/Service[phd]/ensure: ensure changed 'stopped' to 'running' [00:42:31] looks all good [00:42:50] thanks for the conversion [00:47:46] Your welcome :0 [00:47:47] :) [00:47:48] mutante test phd [00:47:50] since it is using upstart [00:47:52] on iridium [00:49:18] root@gerrit-test3:/var/lib/gerrit2/review_site# groups gerrit2 [00:49:18] gerrit2 : nda labsadminbots [00:49:21] woops [00:50:54] paladox: [00:50:54] iridium:/etc/init] $ status phd [00:50:55] phd start/running [00:50:57] i did [00:51:01] Yep [00:51:04] :) [00:51:04] did you mean anything more specific? [00:51:07] Nope [00:51:17] delete that old init script though, right [00:51:46] Yep, that's a to do once we move from trusty to jessie [00:51:53] heh, it can't be stopped [00:52:07] Oh [00:52:12] sudo service phd stop? [00:53:25] paladox: ok, tested more, nevermind :) [00:53:35] it does work and i removed the old symlink [00:53:45] ok :) [00:54:11] !log iridium - tested stop/start of phd service with upstart, unlink /etc/init.t/phd which was the formerly used symlink to a phab php script [00:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:23] :) [00:55:36] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:05:57] hmm no jouncebot? [01:06:23] !log updating phabricator to tag release/2017-03-08/1 [01:06:26] jouncebot: ping [01:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:56] !log phabricator update complete. [01:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:23] jouncebot: now [01:11:23] For the next 0 hour(s) and 48 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T0100) [01:11:47] twentyafterfour: ^ not sure why the dumb bot forgot to yell at you [01:11:53] jouncebot: next [01:11:53] In 12 hour(s) and 48 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T1400) [01:15:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:20:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 15 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:21:06] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:24:36] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:49:06] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [01:55:16] PROBLEM - puppet last run on aqs1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:04:26] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [02:05:16] fixing phd [02:05:26] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 13 processes with UID = 997 (phd) [02:05:56] thanks. btw this check is gone from phab2001 since today [02:06:35] (it will move with the active server) [02:08:10] Yay :D [02:19:41] (03PS1) 10Dzahn: planet: get rid of $realm-case, use Hiera for domain name [puppet] - 10https://gerrit.wikimedia.org/r/341959 [02:23:16] RECOVERY - puppet last run on aqs1009 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [02:23:34] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#3086682 (10aaron) I put it up for SWAT tomorrow. [02:24:09] (03PS2) 10Dzahn: planet: get rid of $realm-case, use Hiera for domain name [puppet] - 10https://gerrit.wikimedia.org/r/341959 [02:25:01] (03CR) 10jerkins-bot: [V: 04-1] planet: get rid of $realm-case, use Hiera for domain name [puppet] - 10https://gerrit.wikimedia.org/r/341959 (owner: 10Dzahn) [02:26:07] (03PS3) 10Dzahn: planet: get rid of $realm-case, use Hiera for domain name [puppet] - 10https://gerrit.wikimedia.org/r/341959 [02:30:55] (03PS1) 10Dzahn: racktables: get rid of $realm-case, use Hiera for host name [puppet] - 10https://gerrit.wikimedia.org/r/341960 [02:32:39] (03CR) 10jerkins-bot: [V: 04-1] racktables: get rid of $realm-case, use Hiera for host name [puppet] - 10https://gerrit.wikimedia.org/r/341960 (owner: 10Dzahn) [02:34:57] (03PS2) 10Dzahn: racktables: get rid of $realm-case, use Hiera for host name [puppet] - 10https://gerrit.wikimedia.org/r/341960 [02:36:34] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 14m 34s) [02:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:46] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.15) (duration: 14m 35s) [03:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:39] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Mar 9 03:15:39 UTC 2017 (duration 5m 53s) [03:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 791.70 seconds [03:31:06] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 251.79 seconds [03:32:40] AaronSchulz: i added a pull request for your class_alias problem. I don't think we have any particular solution to backporting stuff into phan though, which means waiting for a release, upgrading CI to php 7.1 (latest phan is 7.1), and using the new version [03:33:13] (Assuming it works, i also didn't have a valid 7.1 installation to test with so fixed against 0.7 branch, then cherry picked to master and letting travis figure it out ...) [03:33:27] doh wrong channel ... oh well this will work too [03:42:34] fortunately it's nothing urgent [03:49:16] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:12] (03PS1) 10Dzahn: udp2log: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341964 [04:06:52] (03PS1) 10Dzahn: prometheus: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341965 [04:07:55] (03CR) 10jerkins-bot: [V: 04-1] prometheus: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341965 (owner: 10Dzahn) [04:09:45] (03PS1) 10Dzahn: authdns: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341966 [04:12:03] (03CR) 10Dzahn: "this and 2 other patches are the only warnings there are across the whole repo, after that it's warning-free again" [puppet] - 10https://gerrit.wikimedia.org/r/341964 (owner: 10Dzahn) [04:14:04] (03PS2) 10Dzahn: authdns: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341966 [04:17:16] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [04:18:14] (03CR) 10Dzahn: "well, this one is annoying, because while we see here "WARNING indentation of => is not properly aligned (arrow_alignment)" it is actually" [puppet] - 10https://gerrit.wikimedia.org/r/341965 (owner: 10Dzahn) [04:18:52] (03PS2) 10Dzahn: prometheus: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341965 [04:19:23] (03CR) 10Dzahn: [C: 031] site: use spare::system on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/341789 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [04:20:06] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=345.25 Read Requests/Sec=2228.60 Write Requests/Sec=1.10 KBytes Read/Sec=14889.60 KBytes_Written/Sec=44.40 [04:20:22] (03CR) 10jerkins-bot: [V: 04-1] prometheus: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341965 (owner: 10Dzahn) [04:22:05] (03CR) 10Dzahn: "jenkins-bot said -1 for line 3 - 7: modules/mediawiki/manifests/maintenance/uploads.pp WARNING indentation of => is not properly aligned " [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [04:23:43] (03CR) 10Dzahn: maintenance: provision /etc/wgetrc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [04:27:26] PROBLEM - salt-minion processes on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [04:27:46] PROBLEM - Check systemd state on lvs1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:28:41] (03PS2) 10Krinkle: udp2log: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341964 (owner: 10Dzahn) [04:28:54] (03CR) 10Krinkle: "(moved warning from footer-meta to commit-msg body)" [puppet] - 10https://gerrit.wikimedia.org/r/341964 (owner: 10Dzahn) [04:30:06] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.00 Read Requests/Sec=0.90 Write Requests/Sec=6.00 KBytes Read/Sec=14.80 KBytes_Written/Sec=126.80 [04:55:56] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:56] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:57] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:26] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [04:57:27] PROBLEM - DPKG on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:28] PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:30] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:36] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [04:57:39] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:39] PROBLEM - Check rp_filter disabled on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:39] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [04:57:39] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - rendering-https_443 - Could not depool server mw1298.eqiad.wmnet because of too many down!: rendering_80 - Could not depool server mw1298.eqiad.wmnet because of too many down! [04:57:40] PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:40] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - rendering_80 - Could not depool server mw1297.eqiad.wmnet because of too many down! [04:58:06] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:56] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [05:00:56] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [05:00:56] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [05:00:56] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [05:00:56] PROBLEM - puppet last run on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:01:06] PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:01:19] RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 523 bytes in 0.011 second response time [05:01:20] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16132 bytes in 0.053 second response time [05:01:46] PROBLEM - dhclient process on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:01:46] PROBLEM - configured eth on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:01:46] PROBLEM - Disk space on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:02:06] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 23 probes of 416 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [05:03:47] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:04:16] PROBLEM - SSH on lvs1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:04:17] RECOVERY - DPKG on lvs1001 is OK: All packages OK [05:04:28] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16134 bytes in 0.075 second response time [05:04:28] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy [05:04:37] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [05:05:46] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [05:07:07] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:07] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:07] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 15 probes of 416 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [05:07:36] RECOVERY - Check rp_filter disabled on lvs1001 is OK: OK: kernel parameters are set to expected value. [05:07:39] RECOVERY - wiki content on commons on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 153114 bytes in 0.013 second response time [05:07:39] RECOVERY - dhclient process on lvs1001 is OK: PROCS OK: 0 processes with command name dhclient [05:07:39] RECOVERY - configured eth on lvs1001 is OK: OK - interfaces up [05:07:39] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy [05:07:39] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy [05:07:46] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [05:08:36] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1863 bytes in 0.085 second response time [05:09:06] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:09:06] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:09:58] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [05:09:58] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [05:09:58] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [05:09:58] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [05:10:17] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:11:06] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [05:11:06] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:11:46] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [05:12:06] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:06] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:06] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:13:08] PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:46] RECOVERY - Disk space on lvs1001 is OK: DISK OK [05:13:47] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [05:14:06] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:06] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:06] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:06] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:06] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:06] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:06] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:07] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:07] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:08] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:08] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:09] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:58] RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 512 bytes in 0.007 second response time [05:15:56] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [05:15:56] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [05:15:56] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [05:15:56] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [05:15:56] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [05:15:57] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [05:15:57] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [05:15:58] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [05:15:58] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [05:15:59] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [05:15:59] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [05:16:00] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [05:16:00] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [05:16:01] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [05:16:17] PROBLEM - pybal on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:26] PROBLEM - DPKG on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:46] PROBLEM - Check rp_filter disabled on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:01] hmm [05:18:06] not sure if this is real [05:18:29] or if it's something with icinga [05:18:38] PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:18:46] PROBLEM - dhclient process on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:46] PROBLEM - configured eth on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:56] ok [05:19:00] I can't get onto lvs1001 [05:19:06] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 7 probes of 416 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [05:19:38] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:20:36] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - rendering-https_443 - Could not depool server mw1293.eqiad.wmnet because of too many down!: rendering_80 - Could not depool server mw1296.eqiad.wmnet because of too many down! [05:20:56] PROBLEM - Disk space on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:21:06] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [05:21:38] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:22:04] I'm trying to get into mgmt for it [05:22:05] hi bblack-- [05:22:09] or bblack [05:22:11] hi [05:22:23] bblack: I can ping lvs1001 but can't ssh in [05:22:27] ping was also intermittent early on [05:22:32] did the lvs1001 issue predate all the scb, etc spam? [05:22:36] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [05:22:38] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [05:22:48] PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:22:53] seems close enough to the start, anyways [05:23:18] bblack: yeah, not sure. [05:23:27] RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 524 bytes in 0.009 second response time [05:23:29] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16133 bytes in 0.086 second response time [05:23:33] bblack: I see salt on lvs1001 complaining before everything [05:23:46] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [05:23:46] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - rendering-https_443 - Could not depool server mw1295.eqiad.wmnet because of too many down!: rendering_80 - Could not depool server mw1297.eqiad.wmnet because of too many down! [05:24:18] ok [05:24:25] I'm going to halt it from console [05:24:28] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16134 bytes in 0.072 second response time [05:24:43] bblack: ok! [05:24:50] bblack: am on the console already, want me to reboot it? [05:24:54] or shall I leave it to you? [05:24:56] !log poweroff lvs1001 from idrac [05:25:01] did [05:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:12] ok! [05:25:16] lvs1004 should automatically take over for it on death, but if lvs1001 is in some half-dead state [05:25:16] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [05:25:39] * yuvipanda nods [05:25:46] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:25:56] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [05:26:16] PROBLEM - pybal on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:26:29] ganglia says 'down' for lvs1001 [05:26:58] lots of spiking [05:27:17] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:27:17] yeah [05:27:35] I don't see traffic moving to 1004, though [05:28:18] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:28:26] PROBLEM - Host en.m.wikipedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:28:26] PROBLEM - Host en.wikibooks.org is DOWN: PING CRITICAL - Packet loss = 100% [05:28:26] hmm [05:28:48] PROBLEM - Host text-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:29:06] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:06] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:06] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:06] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:06] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:06] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:06] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:07] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:07] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:08] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:08] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:09] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:36] PROBLEM - Host en.wikipedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:29:36] PROBLEM - Host commons.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:29:56] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [05:29:56] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [05:29:57] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [05:29:57] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [05:29:57] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [05:29:57] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [05:29:57] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [05:29:58] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [05:29:58] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [05:29:59] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [05:29:59] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [05:30:00] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [05:30:00] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [05:30:01] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [05:30:36] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1852 bytes in 0.098 second response time [05:30:37] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy [05:30:58] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 86%, RTA = 0.34 ms [05:30:59] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [05:31:36] PROBLEM - Host lvs1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:31:48] RECOVERY - wiki content on commons on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 153114 bytes in 0.013 second response time [05:31:48] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy [05:31:48] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy [05:31:56] RECOVERY - Host commons.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [05:32:38] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:33:27] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16132 bytes in 0.056 second response time [05:33:36] RECOVERY - Host en.m.wikipedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [05:33:36] RECOVERY - Host en.wikibooks.org is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [05:33:36] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [05:33:46] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [05:34:56] RECOVERY - Host en.wikipedia.org is UP: PING WARNING - Packet loss = 54%, RTA = 1.03 ms [05:36:27] 06Operations, 10MediaWiki-General-or-Unknown: Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475#3086770 (10aaron) I wonder why would the restart cause MASTER_GTID_WAIT() to fail in a non-timeout way, e.g. 'Failed to query MASTER_POS_WAIT()'. [05:38:48] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:39:16] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [05:39:38] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [05:40:59] anyone getting packet loss with all things WMF? I'm in NYC, but I tried on my home ISP and my mobile carrier, things are really slow or may not load at all [05:42:16] PROBLEM - Host commons.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (commons.wikimedia.org) [05:42:26] okay so not just me, heh [05:42:41] yes [05:42:56] RECOVERY - Host commons.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 2.70 ms [05:43:17] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:44:08] (03PS1) 10BBlack: depool eqiad front edge traffic [dns] - 10https://gerrit.wikimedia.org/r/341971 [05:44:30] (03CR) 10BBlack: [C: 032] depool eqiad front edge traffic [dns] - 10https://gerrit.wikimedia.org/r/341971 (owner: 10BBlack) [05:45:16] PROBLEM - Host commons.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (commons.wikimedia.org) [05:46:06] RECOVERY - Host commons.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [05:46:46] PROBLEM - Host en.wikibooks.org is DOWN: CRITICAL - Destination Unreachable (en.wikibooks.org) [05:46:48] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:ed1a::1) [05:47:38] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [05:50:26] PROBLEM - Host commons.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (commons.wikimedia.org) [05:51:06] RECOVERY - Host commons.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [05:51:48] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:ed1a::1) [05:51:56] RECOVERY - Host en.wikibooks.org is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [05:52:38] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [05:53:16] PROBLEM - Host commons.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (commons.wikimedia.org) [05:54:06] RECOVERY - Host commons.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [05:54:48] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:55:38] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [05:59:47] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:ed1a::1) [06:00:38] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [06:01:26] PROBLEM - DPKG on lvs1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:16] RECOVERY - DPKG on lvs1004 is OK: All packages OK [06:04:48] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:07:38] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [06:09:16] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:09:49] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:12:26] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:12:38] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [06:13:06] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [06:13:16] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [06:13:26] RECOVERY - DPKG on lvs1001 is OK: All packages OK [06:13:56] RECOVERY - salt-minion processes on lvs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:13:57] RECOVERY - Check systemd state on lvs1001 is OK: OK - running: The system is fully operational [06:13:57] RECOVERY - dhclient process on lvs1001 is OK: PROCS OK: 0 processes with command name dhclient [06:13:57] RECOVERY - configured eth on lvs1001 is OK: OK - interfaces up [06:13:57] RECOVERY - Check rp_filter disabled on lvs1001 is OK: OK: kernel parameters are set to expected value. [06:13:57] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [06:13:57] RECOVERY - Disk space on lvs1001 is OK: DISK OK [06:18:56] PROBLEM - Check systemd state on lvs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:19:16] PROBLEM - salt-minion processes on lvs1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:39:51] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3086795 (10Marostegui) It has actually done some recovering as the file it is scanning now has changed since last night: ``` postgres 7189 0.0 0.... [06:41:56] PROBLEM - salt-minion processes on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:41:56] PROBLEM - Check systemd state on lvs1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:42:16] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:43:16] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:45:56] PROBLEM - Host lvs1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:46:56] RECOVERY - salt-minion processes on lvs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:46:56] RECOVERY - Check systemd state on lvs1001 is OK: OK - running: The system is fully operational [06:47:06] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [06:49:06] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [06:50:07] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [06:50:56] RECOVERY - Check systemd state on lvs1004 is OK: OK - running: The system is fully operational [06:51:16] RECOVERY - salt-minion processes on lvs1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:52:36] PROBLEM - Host lvs1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:53:56] PROBLEM - Check systemd state on lvs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:54:16] PROBLEM - salt-minion processes on lvs1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:55:16] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:08:50] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:ed1a::1) [07:09:40] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [07:11:16] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:15:49] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:ed1a::1) [07:15:50] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.009 second response time [07:16:21] <_joe_> uhm I was sure I scheduled downtime there [07:16:39] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [07:17:46] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.010 second response time [07:21:06] RECOVERY - Check systemd state on lvs1004 is OK: OK - running: The system is fully operational [07:21:16] RECOVERY - salt-minion processes on lvs1004 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [07:23:16] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:45:56] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:48:20] 06Operations, 10MediaWiki-General-or-Unknown: Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475#3086871 (10jcrespo) There are other options- loadbalancing creating new connections timing out when there is no immediate error, or external storage doing that (th... [07:52:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341981 (https://phabricator.wikimedia.org/T159414) [07:53:13] (03PS1) 10Jcrespo: mariadb: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341982 (https://phabricator.wikimedia.org/T159319) [07:53:55] jynus: you go first, I will rebase [07:54:05] (03PS2) 10Jcrespo: mariadb: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341982 (https://phabricator.wikimedia.org/T159319) [07:54:50] I have to wait for jenkins [07:55:19] no worries :) [07:56:12] (03PS3) 10Jcrespo: mariadb: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341982 (https://phabricator.wikimedia.org/T159319) [08:01:27] (03CR) 10Jcrespo: [C: 04-2] "Krinke- I am not going to amend this change, because as I said, I do not plan to deploy it (hence the -2) this is just a template for help" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580) (owner: 10Jcrespo) [08:05:27] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341982 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [08:07:19] (03Merged) 10jenkins-bot: mariadb: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341982 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [08:07:29] (03CR) 10jenkins-bot: mariadb: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341982 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [08:08:14] (03PS2) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341981 (https://phabricator.wikimedia.org/T159414) [08:10:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 after maintenance with low weight (duration: 00m 43s) [08:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341981 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [08:11:06] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [08:12:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341981 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [08:12:24] buffer pool efficiency dropped to 98% on db1051 [08:12:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341981 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [08:12:56] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:13:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1093 - T159414 (duration: 00m 49s) [08:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:49] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [08:14:06] PROBLEM - puppet last run on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:56] PROBLEM - puppet last run on lvs1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:16:36] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:17:56] PROBLEM - PyBal backends health check on lvs1004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [08:18:06] PROBLEM - salt-minion processes on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:06] PROBLEM - dhclient process on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:06] PROBLEM - Disk space on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:06] PROBLEM - configured eth on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:06] PROBLEM - Check rp_filter disabled on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:07] PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:07] PROBLEM - Check systemd state on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:16] PROBLEM - pybal on lvs1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [08:18:16] PROBLEM - SSH on lvs1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:16] PROBLEM - pybal on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:36] PROBLEM - DPKG on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:21:42] !log Deploy alter table s6 revision table on db1093 - T159414 [08:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:48] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [08:24:16] PROBLEM - Host lvs1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:25:16] RECOVERY - pybal on lvs1004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [08:26:06] RECOVERY - PyBal backends health check on lvs1004 is OK: PYBAL OK - All pools are healthy [08:27:26] RECOVERY - DPKG on lvs1001 is OK: All packages OK [08:27:36] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [08:27:56] RECOVERY - Disk space on lvs1001 is OK: DISK OK [08:27:56] RECOVERY - salt-minion processes on lvs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:27:57] RECOVERY - configured eth on lvs1001 is OK: OK - interfaces up [08:27:57] RECOVERY - dhclient process on lvs1001 is OK: PROCS OK: 0 processes with command name dhclient [08:27:57] RECOVERY - Check systemd state on lvs1001 is OK: OK - running: The system is fully operational [08:27:57] RECOVERY - Check rp_filter disabled on lvs1001 is OK: OK: kernel parameters are set to expected value. [08:28:06] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:32:16] PROBLEM - pybal on lvs1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [08:32:56] PROBLEM - PyBal backends health check on lvs1004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [08:34:56] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3086995 (10Marostegui) Not sure from which time this is: ``` FATAL: the database system is starting up FATAL: terminating walreceiver process due... [08:44:26] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:44:36] PROBLEM - NTP on lvs1001 is CRITICAL: NTP CRITICAL: Offset unknown [08:45:18] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3086996 (10jcrespo) That is me killing the replication, which will not work anyway. @akosiaris can you point us to the osm load process, do you have... [08:53:16] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:12] (03CR) 10Hashar: [C: 031] "Definitely \O/" [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [09:14:36] RECOVERY - NTP on lvs1001 is OK: NTP OK: Offset -0.0002154111862 secs [09:17:32] (03CR) 10Hashar: [C: 031] "Since that is not used :-}" [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [09:21:10] (03CR) 10Jcrespo: [C: 032] mariadb: Decouple mariadb::misc role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341825 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [09:21:18] (03PS3) 10Jcrespo: mariadb: Decouple mariadb::misc role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341825 (https://phabricator.wikimedia.org/T150850) [09:22:16] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:22:22] (03CR) 10Hashar: [C: 04-1] "That would probably do it, then it feels like a hack and I would rather have puppet do the proper thing." [puppet] - 10https://gerrit.wikimedia.org/r/340496 (https://phabricator.wikimedia.org/T157785) (owner: 10Paladox) [09:25:51] (03PS3) 10Jcrespo: mariadb: Decouple beta role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850) [09:26:34] (03CR) 10Hashar: [C: 031] labstore: fix typo in snapshot-manager [puppet] - 10https://gerrit.wikimedia.org/r/341427 (owner: 10Hashar) [09:27:26] (03CR) 10Hashar: "Yup apparently that only happens on the first provisioning of a fresh machine. Ordering issue :-}" [puppet] - 10https://gerrit.wikimedia.org/r/341700 (owner: 10Hashar) [09:29:18] (03CR) 10Jcrespo: [C: 032] mariadb: Decouple beta role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [09:29:26] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:36:06] PROBLEM - puppet last run on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:36] PROBLEM - DPKG on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:06] PROBLEM - Check systemd state on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:06] PROBLEM - Check rp_filter disabled on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:06] PROBLEM - dhclient process on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:06] PROBLEM - configured eth on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:16] PROBLEM - SSH on lvs1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:39:56] RECOVERY - dhclient process on lvs1001 is OK: PROCS OK: 0 processes with command name dhclient [09:39:56] RECOVERY - Check rp_filter disabled on lvs1001 is OK: OK: kernel parameters are set to expected value. [09:39:56] RECOVERY - Check systemd state on lvs1001 is OK: OK - running: The system is fully operational [09:39:57] RECOVERY - configured eth on lvs1001 is OK: OK - interfaces up [09:40:06] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [09:40:27] RECOVERY - DPKG on lvs1001 is OK: All packages OK [09:41:30] (03CR) 10Hashar: "I have updated the beta puppetmaster and ran puppet on deployment-db03 and deployment-db04. Only thing that happened is:" [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [09:43:57] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:44:16] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [09:44:56] RECOVERY - PyBal backends health check on lvs1001 is OK: PYBAL OK - All pools are healthy [09:48:56] RECOVERY - PyBal backends health check on lvs1004 is OK: PYBAL OK - All pools are healthy [09:49:16] RECOVERY - pybal on lvs1004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [09:57:26] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:03:18] (03CR) 10Jcrespo: "> I have updated the beta puppetmaster and ran puppet on" [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [10:20:22] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341987 [10:23:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341987 (owner: 10Marostegui) [10:23:35] (03CR) 10Jcrespo: [C: 031] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341987 (owner: 10Marostegui) [10:25:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341987 (owner: 10Marostegui) [10:25:08] !log service systemd-sysctl restart on lvs hosts [10:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 - T159414 (duration: 00m 42s) [10:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:16] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [10:27:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341988 (https://phabricator.wikimedia.org/T159414) [10:27:11] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341987 (owner: 10Marostegui) [10:35:02] (03CR) 10Muehlenhoff: [C: 04-1] PDFRender: Delay service shut-down to work around xpra race (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [10:40:26] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:44:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341988 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [10:46:14] 06Operations, 10ops-codfw, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3087217 (10elukey) So hosts rebooted, verified that puppet ran correctly and executed apt-get dist-upgrade. Verified also ROW allocation: ``` {'mw2251.codfw.wmnet': ' SysName: asw-a-... [10:46:20] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341988 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [10:47:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341988 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [10:47:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1088 - T159414 (duration: 00m 41s) [10:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:32] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [10:49:01] !log Deploy alter table s6 revision table on db1088 - T159414 [10:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:56] (03PS2) 10Giuseppe Lavagetto: profile::etcd::replication: add --strip option [puppet] - 10https://gerrit.wikimedia.org/r/341805 [10:51:58] (03PS1) 10Giuseppe Lavagetto: conftool: switch prefix to /eqiad.wmnet/conftool [puppet] - 10https://gerrit.wikimedia.org/r/341989 (https://phabricator.wikimedia.org/T159687) [10:59:11] 06Operations, 06Labs: Remove linux kernel 3.16 from the jessie image on labs - https://phabricator.wikimedia.org/T159990#3087240 (10MoritzMuehlenhoff) @Paladox : That's entirely unrelated, the Launchpad entry refers to a bug in Upstart, which jessie doesn't use at all. @yuvipanda : I don't think there's a gene... [11:06:41] 06Operations, 10netops: Audit and cleanup border-in ACL on core routers - https://phabricator.wikimedia.org/T160055#3087244 (10mark) [11:07:56] (03PS1) 10Alexandros Kosiaris: Promote labsdb1007 to osm::master. [puppet] - 10https://gerrit.wikimedia.org/r/341991 (https://phabricator.wikimedia.org/T157359) [11:14:26] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [11:15:37] (03CR) 10Jcrespo: [C: 031] Promote labsdb1007 to osm::master. [puppet] - 10https://gerrit.wikimedia.org/r/341991 (https://phabricator.wikimedia.org/T157359) (owner: 10Alexandros Kosiaris) [11:24:32] !log Stop replication db2033 - T159707 [11:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:39] T159707: Import x1 on dbstore2001 - https://phabricator.wikimedia.org/T159707 [11:28:40] (03CR) 10Alexandros Kosiaris: [C: 032] Promote labsdb1007 to osm::master. [puppet] - 10https://gerrit.wikimedia.org/r/341991 (https://phabricator.wikimedia.org/T157359) (owner: 10Alexandros Kosiaris) [11:29:50] jynus: merging yours (5897b00) as well [11:30:00] yes, thanks [11:30:22] I thought I had already done that [11:30:41] ah, I know, because it is a beta-only change, it was tested it there [11:30:51] but it is a noop for production, so I forgot [11:33:38] (03PS1) 10Andrew-WMDE: Don't show rdf2latex table hint with ElectronPdfService enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341992 (https://phabricator.wikimedia.org/T157432) [11:36:39] (03CR) 10WMDE-Fisch: [C: 031] Don't show rdf2latex table hint with ElectronPdfService enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341992 (https://phabricator.wikimedia.org/T157432) (owner: 10Andrew-WMDE) [11:36:48] (03CR) 10Addshore: [C: 031] Don't show rdf2latex table hint with ElectronPdfService enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341992 (https://phabricator.wikimedia.org/T157432) (owner: 10Andrew-WMDE) [11:45:26] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:49:37] (03PS1) 10Urbanecm: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341993 (https://phabricator.wikimedia.org/T150618) [11:52:34] 06Operations, 06Performance-Team, 10Thumbor: Thumbor resource consumption is spiky - https://phabricator.wikimedia.org/T151851#3087379 (10Gilles) 05Open>03Resolved I'm closing this, as the load spikes have considerably lowered and now look reasonable compared to the amount of cores. Memory consumption ha... [11:56:26] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:59:47] 06Operations, 10Icinga: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060#3087392 (10MoritzMuehlenhoff) [12:14:26] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:23:51] !log purging old rc rows from non-production database replicas [12:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:26] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:34:39] !log rebooting multatuli to Linux 4.9 [12:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:39] 06Operations, 13Patch-For-Review: Package the next LTS kernel (4.9) - https://phabricator.wikimedia.org/T154934#3087501 (10MoritzMuehlenhoff) Linux 4.9.13 is now available in jessie-wikimedia/experimental along with updated firmware-nonfree. I have extended linux-meta with a new meta package linux-meta-4.9 whi... [12:43:51] (03PS1) 10Jcrespo: mariadb: Repool db1051 with normal weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341999 (https://phabricator.wikimedia.org/T159319) [12:45:09] (03CR) 10Jcrespo: [C: 04-1] "We may want to wait a bit for the server cache to stabilize: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=13&fullscreen&var-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341999 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [12:45:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342001 [12:45:52] (03PS2) 10Tarrow: remove elasticsearch plugin_dir setting [puppet] - 10https://gerrit.wikimedia.org/r/341831 [12:50:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342001 (owner: 10Marostegui) [12:52:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342001 (owner: 10Marostegui) [12:52:27] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342001 (owner: 10Marostegui) [13:05:40] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Implement DC-local cache failure limiter in Thumbor - https://phabricator.wikimedia.org/T151065#3087549 (10Gilles) [13:07:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1088 - T159414 (duration: 00m 43s) [13:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:03] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [13:09:41] 06Operations, 06Performance-Team, 10Thumbor: Thumbor original file download limit should be 4GB - https://phabricator.wikimedia.org/T151456#3087555 (10Gilles) I'm kind of ambivalent about that now. We could raise the limit, but that would make Thumbor potentially consume a lot more disk when things go wrong.... [13:13:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342005 (https://phabricator.wikimedia.org/T159414) [13:13:19] 06Operations, 06Performance-Team, 10Thumbor: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#3087558 (10Gilles) p:05High>03Low [13:14:57] 06Operations, 06Performance-Team, 10Thumbor, 15User-Joe: Thumbor instances exit with exit code 0 even when crashing/failing - https://phabricator.wikimedia.org/T149560#3087562 (10Gilles) 05Open>03Resolved I'm going to close this as I think it's not actionable. Thumbor is running behind firejail and in... [13:16:01] (03PS1) 10Elukey: Remove Piwik backend probe from Varnish Misc backends [puppet] - 10https://gerrit.wikimedia.org/r/342007 (https://phabricator.wikimedia.org/T159136) [13:17:18] (03PS2) 10Elukey: Remove Piwik backend probe from Varnish Misc backends [puppet] - 10https://gerrit.wikimedia.org/r/342007 (https://phabricator.wikimedia.org/T159136) [13:18:29] (03PS1) 10Muehlenhoff: Harmomise group type for LDAP admin access [puppet] - 10https://gerrit.wikimedia.org/r/342008 (https://phabricator.wikimedia.org/T157131) [13:18:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342005 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [13:20:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342005 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [13:20:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342005 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [13:21:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1085 - T159414 (duration: 00m 41s) [13:21:25] !log Deploy alter table s6 revision table on db1085 - T159414 [13:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:28] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [13:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:50] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/5698/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/342007 (https://phabricator.wikimedia.org/T159136) (owner: 10Elukey) [13:31:01] argh, I notice right now, that I could do bot edits at test2wiki without edit token all the time, because since lunch time today UTC it's not possible any more ... [13:33:57] (03CR) 10Elukey: cache_misc: set timeout_idle to 120s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341576 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [13:43:23] !log invalidating Tasmania zoom level 10 tiles in varnish - T159631 [13:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:32] T159631: Tasmania is covered with water at z10+ - https://phabricator.wikimedia.org/T159631 [13:44:48] jouncebot: next [13:44:48] In 0 hour(s) and 15 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T1400) [13:46:11] !log removed cn=trebuchet group from LDAP directory (Bug: T129788) [13:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:18] T129788: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788 [13:48:35] (03PS2) 10Jcrespo: mariadb: Repool db1051 with normal weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341999 (https://phabricator.wikimedia.org/T159319) [13:48:55] \o/ [13:49:06] (03CR) 10Jcrespo: [C: 032] "Thinking it better, db1055 has a worse cache hit ration, so this can be pooled now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341999 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [13:51:27] (03Merged) 10jenkins-bot: mariadb: Repool db1051 with normal weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341999 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [13:51:39] (03CR) 10jenkins-bot: mariadb: Repool db1051 with normal weight after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341999 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [13:52:46] !log removed cn=svnadm group from LDAP directory (Bug: T129788) [13:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:52] T129788: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788 [13:53:19] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 with normal weight after warmup (duration: 00m 40s) [13:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:38] jouncebot: next [13:54:38] In 0 hour(s) and 5 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T1400) [13:55:53] (03PS2) 10Filippo Giunchedi: site: use spare::system on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/341789 (https://phabricator.wikimedia.org/T123728) [13:58:32] (03PS1) 10Filippo Giunchedi: role: add ipvs prometheus metrics for lvs nodes [puppet] - 10https://gerrit.wikimedia.org/r/342010 [13:58:56] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] site: use spare::system on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/341789 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T1400). Please do the needful. [14:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:14] o/ [14:00:20] o/ [14:00:43] addshore: want to do swat? I can do it if there are no takers :) [14:00:56] zeljkof: I'll let you :) althoguh I may have a patch to add! [14:01:18] addshore: if you have a patch... you do the swat! (that should be the rule) ;) [14:01:28] * addshore hides his patch until later [14:01:37] in that case... [14:01:43] I can SWAT today! [14:02:14] Urbanecm: around for swat? [14:02:26] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:11] (03PS3) 10Zfilipin: [throttle] Add new throttle rule+remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341812 (https://phabricator.wikimedia.org/T159957) (owner: 10Urbanecm) [14:05:51] addshore, hashar: looks like Urbanecm is not around, should I wait with his patches until he is around? [14:06:21] the patches are pretty simple, throttle and logos... [14:06:30] you should [14:06:36] for 341993 anyway [14:07:26] srdjan_m: I should wait? or should deploy? [14:07:36] you should wait [14:08:11] (03PS2) 10Filippo Giunchedi: role: add ipvs prometheus metrics for lvs at ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/342010 [14:08:19] I have added mine to the calender now! [14:08:22] in that case, since there are no other patches... we are done with swat until Urbanecm is back [14:08:32] addshore: want to deploy it yourself? [14:08:35] Will do! [14:08:42] (03PS3) 10Zppix: role: add ipvs prometheus metrics for lvs at ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/342010 (owner: 10Filippo Giunchedi) [14:09:33] addshore: great [14:09:57] (03CR) 10Addshore: [C: 032] Don't show rdf2latex table hint with ElectronPdfService enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341992 (https://phabricator.wikimedia.org/T157432) (owner: 10Andrew-WMDE) [14:10:01] {{doing}} [14:11:55] (03Merged) 10jenkins-bot: Don't show rdf2latex table hint with ElectronPdfService enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341992 (https://phabricator.wikimedia.org/T157432) (owner: 10Andrew-WMDE) [14:12:26] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:12:38] (03PS1) 10Jcrespo: mariadb: Decouple misc role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/342014 (https://phabricator.wikimedia.org/T150850) [14:13:26] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Decouple misc role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/342014 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [14:14:26] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:15:00] syncing [14:15:40] !log addshore@tin Synchronized wmf-config/CommonSettings.php: [[gerrit:341992|Don't show rdf2latex table hint with ElectronPdfService enabled]] T157432 (duration: 00m 49s) [14:15:42] zeljkof, I'm here [14:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:46] T157432: Change text "Rendering finished" > suggest use of "single column" for PDFs with tables - https://phabricator.wikimedia.org/T157432 [14:15:55] Urbanecm: ok [14:16:09] addshore: let me know when you are done, I will deploy Urbanecm's patches [14:16:24] zeljkof: all done here! [14:16:34] its all yours! [14:16:41] ok, taking over [14:17:24] (03CR) 10Filippo Giunchedi: [C: 032] mediawiki::logging: remove fluorine from firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/341940 (https://phabricator.wikimedia.org/T159996) (owner: 10Dzahn) [14:17:59] (03CR) 10jenkins-bot: Don't show rdf2latex table hint with ElectronPdfService enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341992 (https://phabricator.wikimedia.org/T157432) (owner: 10Andrew-WMDE) [14:18:11] (03PS4) 10Zfilipin: [throttle] Add new throttle rule+remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341812 (https://phabricator.wikimedia.org/T159957) (owner: 10Urbanecm) [14:18:29] Urbanecm: rebasing 341812, will +2 and deploy [14:18:50] (03PS2) 10Jcrespo: mariadb: Decouple misc role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/342014 (https://phabricator.wikimedia.org/T150850) [14:19:16] zeljkof, ack [14:19:26] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:20:41] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/5702/" [puppet] - 10https://gerrit.wikimedia.org/r/342010 (owner: 10Filippo Giunchedi) [14:20:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341812 (https://phabricator.wikimedia.org/T159957) (owner: 10Urbanecm) [14:20:55] (03CR) 10Jcrespo: [C: 032] mariadb: Decouple misc role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/342014 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [14:20:59] (03PS1) 10Filippo Giunchedi: hieradata: remove access to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/342017 (https://phabricator.wikimedia.org/T123728) [14:22:34] (03Merged) 10jenkins-bot: [throttle] Add new throttle rule+remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341812 (https://phabricator.wikimedia.org/T159957) (owner: 10Urbanecm) [14:22:36] (03PS3) 10Jcrespo: mariadb: Decouple misc role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/342014 (https://phabricator.wikimedia.org/T150850) [14:22:59] (03PS2) 10Zfilipin: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341993 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:23:46] (03CR) 10jenkins-bot: [throttle] Add new throttle rule+remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341812 (https://phabricator.wikimedia.org/T159957) (owner: 10Urbanecm) [14:23:48] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: remove access to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/342017 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [14:25:03] (03PS4) 10Jcrespo: mariadb: Decouple misc role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/342014 (https://phabricator.wikimedia.org/T150850) [14:25:24] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:341812|throttle] Add new throttle rule+remove expired rules (T159957)]] (duration: 00m 45s) [14:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:31] T159957: Remove IP cap for account creation for MoMA NYC - Saturday March 11 - https://phabricator.wikimedia.org/T159957 [14:25:34] Urbanecm: 341812 deployed [14:25:49] working on 341993 [14:25:51] zeljkof, ack [14:26:12] srwiki-1.5x.png and srwiki-2x.png don't match srwiki.png, just fyi [14:26:38] (03PS1) 10Alexandros Kosiaris: Temporarily set labsdb1007 hiera data [puppet] - 10https://gerrit.wikimedia.org/r/342018 [14:26:48] srdjan_m, must overseen it, thank you for notification [14:27:04] Urbanecm: will you amend the patch? [14:27:15] zeljkof, I'll just delete them and solve them later. [14:27:26] so, I can proceed with the deploy? [14:27:29] as is? [14:27:39] zeljkof, no, please wait. [14:27:46] ok, waiting [14:28:24] (03PS3) 10Urbanecm: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341993 (https://phabricator.wikimedia.org/T150618) [14:28:46] zeljkof, please deploy PS3 [14:30:23] ok, reviewing [14:31:24] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [14:32:52] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341993 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:34:08] (03Merged) 10jenkins-bot: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341993 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:34:24] (03CR) 10jenkins-bot: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341993 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:35:44] !log removed cn=svn group from LDAP directory (Bug: T129788) [14:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:50] T129788: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788 [14:36:33] (03CR) 10Alexandros Kosiaris: [C: 032] Temporarily set labsdb1007 hiera data [puppet] - 10https://gerrit.wikimedia.org/r/342018 (owner: 10Alexandros Kosiaris) [14:36:38] (03PS2) 10Alexandros Kosiaris: Temporarily set labsdb1007 hiera data [puppet] - 10https://gerrit.wikimedia.org/r/342018 [14:36:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Temporarily set labsdb1007 hiera data [puppet] - 10https://gerrit.wikimedia.org/r/342018 (owner: 10Alexandros Kosiaris) [14:36:56] I was about to do the same [14:38:38] I think the reason those were there was labsdb1006/7 themselves :-) [14:38:52] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:341993|Add HD logos for several projects (T150618)]] (duration: 00m 42s) [14:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:58] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [14:39:44] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:341993|Add HD logos for several projects (T150618)]] (duration: 00m 41s) [14:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:05] Urbanecm: 341993 deployed, please check the logos at wikis [14:40:10] zeljkof, checking [14:41:35] zeljkof, working [14:41:58] Urbanecm: all good? [14:42:01] (03CR) 10Ema: [C: 031] role: add ipvs prometheus metrics for lvs at ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/342010 (owner: 10Filippo Giunchedi) [14:42:02] Yep [14:42:08] great, in that case... [14:42:14] !log EU SWAT finished [14:42:14] (03PS1) 10Alexandros Kosiaris: role::osm::common: Conditionalize tuning.conf inclusion [puppet] - 10https://gerrit.wikimedia.org/r/342019 [14:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342020 [14:42:52] (03PS4) 10Filippo Giunchedi: role: add ipvs prometheus metrics for lvs at ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/342010 [14:45:13] (03CR) 10Alexandros Kosiaris: [C: 032] role::osm::common: Conditionalize tuning.conf inclusion [puppet] - 10https://gerrit.wikimedia.org/r/342019 (owner: 10Alexandros Kosiaris) [14:45:17] (03PS2) 10Alexandros Kosiaris: role::osm::common: Conditionalize tuning.conf inclusion [puppet] - 10https://gerrit.wikimedia.org/r/342019 [14:45:20] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] role::osm::common: Conditionalize tuning.conf inclusion [puppet] - 10https://gerrit.wikimedia.org/r/342019 (owner: 10Alexandros Kosiaris) [14:45:40] (03CR) 10Filippo Giunchedi: [C: 032] role: add ipvs prometheus metrics for lvs at ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/342010 (owner: 10Filippo Giunchedi) [14:45:48] (03PS5) 10Filippo Giunchedi: role: add ipvs prometheus metrics for lvs at ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/342010 [14:46:07] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] role: add ipvs prometheus metrics for lvs at ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/342010 (owner: 10Filippo Giunchedi) [14:49:24] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:55:05] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3087865 (10Ottomata) I think we are ready to proceed with this when yall are. Should we schedule a day next wee... [14:55:21] (03PS1) 10Alexandros Kosiaris: osm::planet_import: conditionalize load of 900913.sql [puppet] - 10https://gerrit.wikimedia.org/r/342022 [14:55:26] jouncebot: now [14:55:26] For the next 0 hour(s) and 4 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T1400) [14:57:41] (03CR) 10Filippo Giunchedi: "Parsing /proc/net/ip_vs_stats works:" [puppet] - 10https://gerrit.wikimedia.org/r/342010 (owner: 10Filippo Giunchedi) [14:57:43] (03CR) 10Gehel: [C: 04-1] remove elasticsearch plugin_dir setting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [14:57:47] (03PS3) 10Gehel: remove elasticsearch plugin_dir setting [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [14:57:58] (03CR) 10Alexandros Kosiaris: [C: 032] osm::planet_import: conditionalize load of 900913.sql [puppet] - 10https://gerrit.wikimedia.org/r/342022 (owner: 10Alexandros Kosiaris) [14:58:03] (03PS2) 10Alexandros Kosiaris: osm::planet_import: conditionalize load of 900913.sql [puppet] - 10https://gerrit.wikimedia.org/r/342022 [14:58:07] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] osm::planet_import: conditionalize load of 900913.sql [puppet] - 10https://gerrit.wikimedia.org/r/342022 (owner: 10Alexandros Kosiaris) [14:59:06] (03CR) 10jerkins-bot: [V: 04-1] remove elasticsearch plugin_dir setting [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [15:01:21] (03CR) 10Gehel: [C: 04-1] remove elasticsearch plugin_dir setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [15:02:34] !log installing nettle security updates [15:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:26] gehel: I'm not quite sure I explained clearly what is needed in https://gerrit.wikimedia.org/r/341831. Having $plugins_dir set in the base ::elasticsearch means that it fails unless you have made the link like in "common" and "logstash" [15:07:36] tarrow: do you have the exact failure message? [15:07:39] !log reedy@tin Synchronized php-1.29.0-wmf.15/extensions/ConfirmEdit: Fixup maintenance script (duration: 00m 43s) [15:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:59] (03CR) 10Jcrespo: [C: 032] mariadb: Decouple misc role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/342014 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [15:08:04] (03PS5) 10Jcrespo: mariadb: Decouple misc role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/342014 (https://phabricator.wikimedia.org/T150850) [15:08:36] (03PS3) 10Elukey: Remove Piwik backend probe from Varnish Misc backends [puppet] - 10https://gerrit.wikimedia.org/r/342007 (https://phabricator.wikimedia.org/T154558) [15:08:38] well; in the ES log you get "java.lang.IllegalStateException: Unable to access 'path.plugins' (/srv/deployment/elasticsearch/plugins)" because we've never made it [15:09:24] Ok, so in your case, you probably want to set it to /usr/share/elasticsearch/plugins [15:09:28] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3070663 (10elukey) Just completed https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Worker_Nodes_... [15:10:08] gehel: yep, that also works. I assumed the right think to do was to set it to /usr/share/elasticsearch/plugins in ::elasticsearch (by removing it) [15:10:17] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342020 [15:10:33] tarrow: yeah, that would be nicer, but it will break our production cluster... [15:10:48] and then set it to "/srv/deployment/elasticsearch/plugins" in the production roles [15:11:14] there are some assumptions in that module that only hold true for our specific setup [15:11:38] which is what I thought my second patch did; but I obviously couldn't test [15:12:06] Oh, I see what you mean. That would work as well, but then you need to keep the param, not remove it. And change the default value. I'll send a patch [15:13:46] ah, sure. I didn't realise removing it did anything other than set it to the default value. It seemed to work that you could still override it fine [15:13:49] thanks! [15:14:27] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342020 (owner: 10Marostegui) [15:14:41] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3087960 (10mobrovac) Wednesday 15th? [15:16:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342020 (owner: 10Marostegui) [15:17:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1085 - T159414 (duration: 00m 41s) [15:17:11] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342020 (owner: 10Marostegui) [15:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:14] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [15:17:37] (03PS4) 10Gehel: elasticsearch - use default plugins directory in the elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [15:18:42] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - use default plugins directory in the elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [15:20:06] (03PS5) 10Gehel: elasticsearch - use default plugins directory in the elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [15:20:52] (03CR) 10Gehel: [C: 031] "This is a noop on our current cluster (as it should): https://puppet-compiler.wmflabs.org/5705/" [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [15:25:41] (03PS6) 10Gehel: elasticsearch - use default plugins directory in the elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [15:27:01] (03CR) 10Gehel: [C: 032] elasticsearch - use default plugins directory in the elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [15:27:34] tarrow: ^ you should be good (let me know otherwise) [15:27:34] gehel: thanks! that's awesome! :) [15:27:40] I'll just have a test [15:35:30] (03PS1) 10Ema: lvs: load ip_vs before systemd-sysctl.service starts [puppet] - 10https://gerrit.wikimedia.org/r/342026 [15:36:53] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM but small error" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342026 (owner: 10Ema) [15:37:06] <_joe_> ema: good idea to preload the module [15:37:23] <_joe_> but you stumbled upon one of puppet's delicacies I think [15:37:54] ha! [15:37:58] thanks _joe_ [15:38:02] (03CR) 10Mobrovac: [C: 031] PDFRender: Delay service shut-down to work around xpra race (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [15:38:16] <_joe_> ema: puppet apply -e "notice('ciao\n')" vs puppet apply -e 'notice("ciao\n")' [15:38:36] and of course no linter will complain that there's no variable in that double quoted string [15:39:40] <_joe_> lol tell me that's really happening :P [15:39:48] (03PS2) 10Ema: lvs: load ip_vs before systemd-sysctl.service starts [puppet] - 10https://gerrit.wikimedia.org/r/342026 [15:39:54] _joe_: let's see! [15:40:34] _joe_: nope, puppet-lint is happy [15:40:44] <_joe_> it's not /that/ stupid [15:41:25] low expectations, key to happiness [15:41:34] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:37] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/342026 (owner: 10Ema) [15:45:04] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [15:45:34] RECOVERY - puppet last run on elastic1037 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:45:40] (03PS2) 10Muehlenhoff: Remove Aaron from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/340101 [15:48:38] (03PS3) 10Ema: lvs: load ip_vs before systemd-sysctl.service starts [puppet] - 10https://gerrit.wikimedia.org/r/342026 [15:48:48] (03CR) 10Ema: [V: 032 C: 032] lvs: load ip_vs before systemd-sysctl.service starts [puppet] - 10https://gerrit.wikimedia.org/r/342026 (owner: 10Ema) [15:50:08] 06Operations, 10ops-codfw: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088068 (10Papaul) [15:50:14] (03PS3) 10Muehlenhoff: Remove Aaron from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/340101 [15:53:28] (03CR) 10Muehlenhoff: [C: 032] Remove Aaron from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/340101 (owner: 10Muehlenhoff) [15:57:16] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3088144 (10Ottomata) [15:57:20] 06Operations: Rethink/clarify/document use of 'analytics' vs. 'statistics' in group names - https://phabricator.wikimedia.org/T149225#3088142 (10Ottomata) 05Open>03declined I've added a sentence or two to help explain the difference here: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Group... [15:59:30] 06Operations, 10ops-codfw, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3088150 (10Papaul) Disk wipe in progress [15:59:46] (03CR) 10Ema: [C: 031] Remove Piwik backend probe from Varnish Misc backends [puppet] - 10https://gerrit.wikimedia.org/r/342007 (https://phabricator.wikimedia.org/T154558) (owner: 10Elukey) [16:00:09] 06Operations, 10ops-codfw: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088151 (10Papaul) p:05Triage>03Normal [16:02:37] (03PS4) 10Elukey: Remove Piwik backend probe from Varnish Misc backends [puppet] - 10https://gerrit.wikimedia.org/r/342007 (https://phabricator.wikimedia.org/T154558) [16:05:28] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3088160 (10Ottomata) [16:05:31] 06Operations: Reconsider/check naming of 'privatedata' shell groups compared to their theoretically non-sensitive counterparts - https://phabricator.wikimedia.org/T149222#3088158 (10Ottomata) 05Open>03declined The '*private*' user groups here grant access to stat1002. Historically, stat1002 was used to host... [16:07:45] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3088162 (10Ottomata) Ya. @elukey should we do varnishes before or after this? I can add librdkafka 0.9.4 to ou... [16:09:10] (03CR) 10Elukey: [C: 032] Remove Piwik backend probe from Varnish Misc backends [puppet] - 10https://gerrit.wikimedia.org/r/342007 (https://phabricator.wikimedia.org/T154558) (owner: 10Elukey) [16:10:25] !log remove Piwik/bohrium health check from Varnish cache misc (https://gerrit.wikimedia.org/r/#/c/342007/) [16:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:38] (03PS1) 10DCausse: [es5 upgrade] step 1: depool codfw for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342031 (https://phabricator.wikimedia.org/T157479) [16:15:39] (03PS1) 10DCausse: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) [16:15:42] (03PS1) 10DCausse: [es5 upgrade] step 3: depool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342033 (https://phabricator.wikimedia.org/T157479) [16:15:44] (03PS1) 10DCausse: [es5 upgrade] step 4: repool eqiad and restore normal operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342034 (https://phabricator.wikimedia.org/T157479) [16:22:39] (03PS4) 10Gehel: deployment-prep: Use elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/341372 (owner: 10EBernhardson) [16:27:07] (03CR) 10Gehel: [C: 032] deployment-prep: Use elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/341372 (owner: 10EBernhardson) [16:27:44] (03PS1) 10Papaul: DNS: Add mgmt dns for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/342036 [16:32:54] PROBLEM - MariaDB Slave SQL: x1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table heartbeat.heartbeat: Cant find record in heartbeat, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1031-bin.002061, end_log_pos 445907880 [16:33:55] (03CR) 10Madhuvishy: [C: 032] Remove non-existing group from jupyterhub LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/341336 (https://phabricator.wikimedia.org/T129788) (owner: 10Muehlenhoff) [16:34:02] (03PS2) 10Madhuvishy: Remove non-existing group from jupyterhub LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/341336 (https://phabricator.wikimedia.org/T129788) (owner: 10Muehlenhoff) [16:34:51] (03CR) 10Madhuvishy: [V: 032 C: 032] Remove non-existing group from jupyterhub LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/341336 (https://phabricator.wikimedia.org/T129788) (owner: 10Muehlenhoff) [16:36:55] ^manuel and me are on the alert [16:37:45] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088249 (10Papaul) [16:44:35] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088255 (10fgiunchedi) Row C works (oresrdb2001 is row B) partman scheme: `raid1-lvm-ext4-srv.cfg` (same as oresrdb1001) [16:44:41] papaul: ^ [16:51:16] (03PS1) 10EBernhardson: [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 [16:54:17] (03PS2) 10RobH: DNS: Add mgmt dns for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/342036 (owner: 10Papaul) [16:58:50] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1001.eqiad.wmnet [16:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:08] (03PS1) 10MarcoAurelio: Allow 'autoreviewrestore' to be managed from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342042 [17:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T1700). [17:00:55] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088284 (10Papaul) [17:01:56] no puppet swat patches https://i.imgur.com/m5lwP.gif [17:02:08] !log reboot lvs1004 (post-incident cleanup reboot) [17:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:53] 06Operations, 10ops-codfw, 10netops: codfw: oresrdb2002 switch port configuration - https://phabricator.wikimedia.org/T160087#3088288 (10Papaul) [17:03:21] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088304 (10Papaul) [17:04:16] godog: thanks [17:04:55] (03CR) 10DCausse: [C: 031] [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 (owner: 10EBernhardson) [17:05:44] PROBLEM - Host lvs1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:04] RECOVERY - Host lvs1004 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [17:08:54] RECOVERY - MariaDB Slave SQL: x1 on dbstore2001 is OK: OK slave_sql_state not a slave [17:09:12] 06Operations, 13Patch-For-Review: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#3088327 (10fgiunchedi) [17:10:04] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:10:16] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#3088335 (10fgiunchedi) [17:10:18] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3088336 (10fgiunchedi) [17:10:21] 06Operations, 13Patch-For-Review: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#2243032 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is completed, I've left out the part about sending logs to both datacenters as out of sco... [17:11:02] !log reboot lvs1001 (post-incident cleanup reboot) [17:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:12] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1002.eqiad.wmnet [17:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:21] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1003.eqiad.wmnet [17:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:14] PROBLEM - Host lvs1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:44] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:15:50] (03PS1) 10Gehel: elasticsearch - make size of bulk executor configureable [puppet] - 10https://gerrit.wikimedia.org/r/342043 [17:16:49] (03CR) 10EBernhardson: [C: 031] elasticsearch - make size of bulk executor configureable [puppet] - 10https://gerrit.wikimedia.org/r/342043 (owner: 10Gehel) [17:19:33] (03PS2) 10DCausse: Elastic 5.1.2 plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/341826 [17:21:29] (03PS2) 10Gehel: elasticsearch - make size of bulk executor configureable [puppet] - 10https://gerrit.wikimedia.org/r/342043 [17:24:11] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch - make size of bulk executor configureable [puppet] - 10https://gerrit.wikimedia.org/r/342043 (owner: 10Gehel) [17:28:30] (03PS1) 10Jcrespo: mariadb: Decouple tendril mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342050 (https://phabricator.wikimedia.org/T150850) [17:29:34] (03PS2) 10Jcrespo: mariadb: Decouple tendril mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342050 (https://phabricator.wikimedia.org/T150850) [17:31:17] (03CR) 10Hashar: [C: 04-1] "We had to enable StatCache or the performance just crawled down. Ori made it very early you can see for the details T75706" [puppet] - 10https://gerrit.wikimedia.org/r/341916 (https://phabricator.wikimedia.org/T158176) (owner: 10Reedy) [17:31:43] 06Operations, 07HHVM, 13Patch-For-Review: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3028990 (10hashar) Unless things have changed, we had to enable StatCache or the performance just crawled down. Ori made it very early you can see for the details on T75706 [17:31:53] (03PS1) 10Gehel: elasticsearch - statsd plugin isn't used anymore [puppet] - 10https://gerrit.wikimedia.org/r/342052 [17:34:03] 06Operations, 07HHVM, 13Patch-For-Review: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3088422 (10Reedy) >>! In T158176#3088412, @hashar wrote: > Unless things have changed, we had to enable StatCache or the performance just crawled down. Ori made it very early you can see for th... [17:34:26] (03PS3) 10Jcrespo: mariadb: Decouple tendril mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342050 (https://phabricator.wikimedia.org/T150850) [17:34:28] (03PS1) 10Jcrespo: mariadb: Decouple core (mediawiki) role on a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342054 (https://phabricator.wikimedia.org/T150850) [17:36:43] (03PS2) 10Gehel: elasticsearch - statsd plugin isn't used anymore [puppet] - 10https://gerrit.wikimedia.org/r/342052 [17:39:39] 06Operations, 07HHVM, 13Patch-For-Review: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3088436 (10hashar) The assertion failure definitely happens on the beta cluster. I haven't found a good way to reproduce. Looking at the wfDebugLog I managed to find some URL that probably trig... [17:40:22] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3088437 (10Papaul) @Robh yes please do. Thanks [17:40:25] (03PS18) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [17:41:18] (03CR) 10Jcrespo: [C: 032] mariadb: Decouple tendril mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342050 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [17:41:49] (03PS3) 10Gehel: elasticsearch - statsd plugin isn't used anymore [puppet] - 10https://gerrit.wikimedia.org/r/342052 [17:43:11] 06Operations, 06Labs: Remove linux kernel 3.16 from the jessie image on labs - https://phabricator.wikimedia.org/T159990#3088459 (10Andrew) I just built four different jessie instances, ran 'apt get update && apt-get upgrade' on them and rebooted. All four came up, no problems. [17:43:42] (03CR) 10Chad: "That one is actually being used" [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [17:43:51] 06Operations, 07HHVM, 13Patch-For-Review: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3088461 (10hashar) The beta log spam stopped because StatCache has been disabled via cherry pick of https://gerrit.wikimedia.org/r/#/c/341916/ . That hardly made any change to the instances CPU... [17:45:32] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1004.eqiad.wmnet [17:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:51] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3088468 (10hashar) [17:47:30] 06Operations, 06Labs: Remove linux kernel 3.16 from the jessie image on labs - https://phabricator.wikimedia.org/T159990#3088472 (10Paladox) Oh, i wonder why mine failed. [17:50:01] (03PS2) 10Jcrespo: mariadb: Decouple core (mediawiki) role on a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342054 (https://phabricator.wikimedia.org/T150850) [17:50:03] (03PS1) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) [17:50:06] (03PS1) 10BryanDavis: Remove support for Precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342061 (https://phabricator.wikimedia.org/T94792) [17:50:37] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1003.eqiad.wmnet [17:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:52] (03CR) 10RobH: [C: 032] DNS: Add mgmt dns for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/342036 (owner: 10Papaul) [17:51:38] (03PS2) 10BryanDavis: Remove support for Precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342061 (https://phabricator.wikimedia.org/T94792) [17:51:44] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:42] (03PS3) 10BryanDavis: Remove support for Precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342061 (https://phabricator.wikimedia.org/T94792) [17:59:44] (03PS1) 10Muehlenhoff: Enable experimental on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/342063 [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T1800). [18:00:15] Nothing for ORES today [18:00:17] no parsoid deploy today [18:00:53] (03PS1) 10Jcrespo: mariadb: Decouple mariadb wikitech role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342064 (https://phabricator.wikimedia.org/T150850) [18:05:59] (03CR) 10Muehlenhoff: [V: 032 C: 032] Enable experimental on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/342063 (owner: 10Muehlenhoff) [18:06:30] (03PS4) 10BryanDavis: Remove support for Precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342061 (https://phabricator.wikimedia.org/T94792) [18:12:16] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088551 (10RobH) [18:12:19] 06Operations, 10ops-codfw, 10netops: codfw: oresrdb2002 switch port configuration - https://phabricator.wikimedia.org/T160087#3088549 (10RobH) 05Open>03Resolved switch port updated and committed, resolving task ``` robh@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw... [18:19:44] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:21:12] !log rebooting cp1008 for upgrade to Linux 4.9 [18:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:39] 06Operations, 06DC-Ops: audit spare disk levels for codfw & eqiad utlized storage in servers - https://phabricator.wikimedia.org/T160097#3088618 (10RobH) [18:22:42] 06Operations, 06DC-Ops: audit spare disk levels for codfw & eqiad against shelf spares - https://phabricator.wikimedia.org/T160097#3088631 (10RobH) [18:22:43] 06Operations, 06DC-Ops: audit spare disk levels for codfw & eqiad utlized storage in servers - https://phabricator.wikimedia.org/T160097#3088618 (10RobH) [18:23:02] too many open tabs im reverting my own task edits =P [18:31:23] (03CR) 10Filippo Giunchedi: "LGTM on the idea, comments on the implementation" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) (owner: 10Gilles) [18:34:34] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:05] 06Operations, 10media-storage: Sanity check global-multiwrite logs for ConfirmEdit usage - https://phabricator.wikimedia.org/T159830#3088694 (10fgiunchedi) I took a quick look at the swift logs on lithium, all DELETEs seem to be successful (i.e. HTTP 200s) with the exception of some for which swift replied 404... [18:44:28] jouncebot: next [18:44:28] In 0 hour(s) and 15 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T1900) [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T1900). [19:01:02] (03PS1) 10BryanDavis: Full PEP8/Flake8 compliance [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342069 [19:02:14] (03Abandoned) 10DCausse: [cirrus] Add $wgCirrusSearchElasticQuirks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339409 (owner: 10DCausse) [19:02:34] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:02:44] (03PS2) 10Dzahn: mediawiki::logging: remove fluorine from firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/341940 (https://phabricator.wikimedia.org/T159996) [19:04:25] (03CR) 10Dzahn: [C: 032] "fluorine is a "role spare" now and about to be decom'ed." [puppet] - 10https://gerrit.wikimedia.org/r/341940 (https://phabricator.wikimedia.org/T159996) (owner: 10Dzahn) [19:06:40] (03CR) 10Yuvipanda: [C: 031] "This should work." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342061 (https://phabricator.wikimedia.org/T94792) (owner: 10BryanDavis) [19:07:34] PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:14:46] (03CR) 10Andrew Bogott: [C: 032] mariadb: Decouple mariadb wikitech role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342064 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [19:18:47] (03PS3) 10Dzahn: mediawiki::logging: remove fluorine from firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/341940 (https://phabricator.wikimedia.org/T159996) [19:19:15] (03CR) 10Andrew Bogott: [C: 031] Harmomise group type for LDAP admin access [puppet] - 10https://gerrit.wikimedia.org/r/342008 (https://phabricator.wikimedia.org/T157131) (owner: 10Muehlenhoff) [19:19:45] (03CR) 10Andrew Bogott: [C: 032] labstore: fix typo in snapshot-manager [puppet] - 10https://gerrit.wikimedia.org/r/341427 (owner: 10Hashar) [19:19:49] (03PS2) 10Andrew Bogott: labstore: fix typo in snapshot-manager [puppet] - 10https://gerrit.wikimedia.org/r/341427 (owner: 10Hashar) [19:20:23] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3088814 (10Dzahn) a:03Dzahn [19:20:56] (03PS1) 10Filippo Giunchedi: Provision new ms-be machines in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342074 (https://phabricator.wikimedia.org/T158337) [19:21:42] (03CR) 10Dzahn: [V: 032 C: 032] mediawiki::logging: remove fluorine from firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/341940 (https://phabricator.wikimedia.org/T159996) (owner: 10Dzahn) [19:22:18] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/341434 (owner: 10Dzahn) [19:22:43] !log foreachwiki extensions/WikimediaMaintenance/createExtensionTables.php linter [19:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:49] (03PS1) 10Eevans: Optional Cassandra client encryption; Enabled on RESTBase Staging [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) [19:22:51] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/341427/ has been merged (thx AndrewBogott) so now this should be Verified" [puppet] - 10https://gerrit.wikimedia.org/r/341434 (owner: 10Dzahn) [19:24:10] (03CR) 10Dzahn: "on eventlog1001:" [puppet] - 10https://gerrit.wikimedia.org/r/341940 (https://phabricator.wikimedia.org/T159996) (owner: 10Dzahn) [19:24:18] (03PS3) 10Andrew Bogott: labstore: fix typo in snapshot-manager [puppet] - 10https://gerrit.wikimedia.org/r/341427 (owner: 10Hashar) [19:25:39] (03CR) 10jerkins-bot: [V: 04-1] typos: add 'criticial' [puppet] - 10https://gerrit.wikimedia.org/r/341434 (owner: 10Dzahn) [19:26:12] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3088824 (10RobH) 05Open>03Resolved port info is no longer on switches, resolving task [19:26:47] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3088826 (10RobH) a:05Papaul>03RobH [19:29:18] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3088834 (10Dzahn) [19:31:16] (03PS2) 10Eevans: Optional Cassandra client encryption; Enabled on RESTBase Staging [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) [19:31:18] 06Operations, 10Packaging: Upgrade php5-json .deb to at least 1.3.8 - https://phabricator.wikimedia.org/T160101#3088838 (10Legoktm) [19:33:34] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 453.41 seconds [19:35:30] (03CR) 10Eevans: "PC output: http://puppet-compiler.wmflabs.org/5711/" [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [19:35:34] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:35:51] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3088868 (10Dzahn) [19:36:21] (03CR) 10Dzahn: [C: 032] remove fluorine from DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/341939 (https://phabricator.wikimedia.org/T159996) (owner: 10Dzahn) [19:36:31] (03PS2) 10Dzahn: remove fluorine from DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/341939 (https://phabricator.wikimedia.org/T159996) [19:43:10] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/341434 (owner: 10Dzahn) [19:44:34] PROBLEM - puppet last run on wtp1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:47:37] (03CR) 10Dzahn: [C: 032] typos: add 'criticial' [puppet] - 10https://gerrit.wikimedia.org/r/341434 (owner: 10Dzahn) [19:47:42] (03PS3) 10Dzahn: typos: add 'criticial' [puppet] - 10https://gerrit.wikimedia.org/r/341434 [19:47:51] (03CR) 10Dzahn: [V: 032 C: 032] typos: add 'criticial' [puppet] - 10https://gerrit.wikimedia.org/r/341434 (owner: 10Dzahn) [19:49:01] (03PS3) 10Dzahn: udp2log: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341964 [19:52:58] (03PS4) 10Dzahn: udp2log: remove "lint-ignore" that has been fixed [puppet] - 10https://gerrit.wikimedia.org/r/341964 [19:53:32] !log reedy@tin Synchronized php-1.29.0-wmf.15/extensions/ConfirmEdit: Fixup maintenance script (duration: 00m 43s) [19:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:25] (03CR) 10Dzahn: [C: 032] udp2log: remove "lint-ignore" that has been fixed [puppet] - 10https://gerrit.wikimedia.org/r/341964 (owner: 10Dzahn) [19:55:51] (03PS2) 10Dzahn: Gerrit: Remove reviewer counts cron, nobody is using it [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [19:56:34] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:58:22] (03CR) 10Dzahn: [C: 032] Gerrit: Remove reviewer counts cron, nobody is using it [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170309T2000). Please do the needful. [20:02:57] !log cobalt: remove crontab entry of user gerrit2 that created reviewer counts, gzip /var/www/reviewer-counts.json and moved to /root/ for backup (re: gerrit:341592) T54329 [20:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:04] T54329: Provide reviewer counts per Gerrit changeset in batch form - https://phabricator.wikimedia.org/T54329 [20:03:43] (03CR) 10Dzahn: "!log cobalt: remove crontab entry of user gerrit2 that created reviewer counts, gzip /var/www/reviewer-counts.json and moved to /root/ for" [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [20:04:37] (03PS4) 10Dzahn: planet: get rid of $realm-case, use Hiera for domain name [puppet] - 10https://gerrit.wikimedia.org/r/341959 [20:06:29] (03PS1) 1020after4: all wikis to 1.29.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342078 [20:06:31] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342078 (owner: 1020after4) [20:08:43] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342078 (owner: 1020after4) [20:08:56] (03CR) 10jenkins-bot: all wikis to 1.29.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342078 (owner: 1020after4) [20:09:13] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.15 [20:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:34] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:18:32] (03CR) 10Krinkle: "No worries, I appreciate it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580) (owner: 10Jcrespo) [20:22:47] (03PS1) 10Papaul: DNS:Add production dns for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/342081 [20:23:02] (03CR) 10jerkins-bot: [V: 04-1] DNS:Add production dns for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/342081 (owner: 10Papaul) [20:24:34] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:24:46] (03PS19) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [20:25:23] (03Draft1) 10Paladox: Gerrit: Make sure any services under the gerrit2 user are stopped [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 [20:25:25] (03PS2) 10Paladox: Gerrit: Make sure any services under the gerrit2 user are stopped [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 [20:25:33] RainbowSprinkles ^^ :) [20:26:02] (03CR) 10jerkins-bot: [V: 04-1] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [20:27:40] (03CR) 10Krinkle: [C: 04-1] "LGTM (assuming tests pass) - two minor nits left." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [20:28:52] (03PS3) 10Paladox: Gerrit: Make sure any services under the gerrit2 user are stopped [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 [20:29:50] (03CR) 10Chad: Gerrit: Make sure any services under the gerrit2 user are stopped (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [20:30:37] (03CR) 10Paladox: Gerrit: Make sure any services under the gerrit2 user are stopped (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [20:32:27] (03CR) 10Paladox: "killall is dangerous as it fails lintian." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [20:32:30] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5712/" [puppet] - 10https://gerrit.wikimedia.org/r/341959 (owner: 10Dzahn) [20:36:18] (03PS20) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [20:37:04] (03CR) 10Chad: Gerrit: Make sure any services under the gerrit2 user are stopped (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [20:38:56] What would be the most elegant way to specify an associative array in heira, and then template that into a yaml file? Do we have an example of such a thing in operations/puppet? [20:43:44] urandom: we have lots of associative arrays in our hieradata, all over the hieradata/ subdirectories in the ops/puppet repo [20:44:11] foo::bar: [20:44:15] (03PS4) 10Paladox: Gerrit: Make sure any services under the gerrit2 user are stopped [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 [20:44:16] key1: value1 [20:44:18] (03CR) 10Krinkle: [C: 031] "'python -m unittest navtiming' passes." [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [20:44:19] key2: value2 [20:44:20] etc... [20:45:00] oh I misunderstood your question I think [20:45:13] you want the (yaml) hieradata to then be ERB-templated into a yaml output file on the host [20:45:26] I think that depends on whether it's just a single-depth array of simple keys or not [20:45:34] (if it is, just iterate it in ruby and output yaml-looking strings) [20:45:40] (if not... ?) [20:45:48] (03PS3) 10Dzahn: racktables: get rid of $realm-case, use Hiera for host name [puppet] - 10https://gerrit.wikimedia.org/r/341960 [20:48:00] (03CR) 10Dzahn: "yea, i recommended killall first because it can kill by process name to avoid the scripting to find the UID, then lintian said that is dan" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [20:49:01] bblack: yeah, it's single-depth (for now) [20:49:15] or single-depth would be Good Enough I think [20:50:01] (03CR) 10Ottomata: [C: 031] "A follow up here would be to use Kafka instead of ZMQ. We can do that after we get webperf eventlogging off of trebuchet and onto scap." [puppet] - 10https://gerrit.wikimedia.org/r/341724 (owner: 10Krinkle) [20:50:10] (03CR) 10Dzahn: "generally about kill i always had this quote: "Generally, send 15, and wait a second or two, and if that doesn't work, send 2, and if that" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [20:50:54] (03PS1) 10Papaul: Add MAC address and partman entries for oresrdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/342086 [20:50:56] bblack: i was hoping for something more elegant, but in this case, didn't have strong expectations :) [20:52:05] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5713/" [puppet] - 10https://gerrit.wikimedia.org/r/341960 (owner: 10Dzahn) [20:53:19] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3089206 (10Papaul) [20:54:18] 06Operations, 10RESTBase, 10service-runner, 06Services (doing), 15User-mobrovac: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3089212 (10mobrovac) Ok, after losing half of a day on this, I realised that using `/var/log` is not going to fly with firejail. It explicitly [... [20:55:34] (03PS2) 10Dzahn: Add MAC address and partman entries for oresrdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/342086 (owner: 10Papaul) [20:58:03] (03CR) 10Dzahn: [C: 032] Add MAC address and partman entries for oresrdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/342086 (owner: 10Papaul) [20:59:56] (03CR) 10Dzahn: [C: 04-1] "in "wmnet" the IP is not complete" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/342081 (owner: 10Papaul) [21:03:41] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3089287 (10Paladox) [21:03:45] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3089299 (10Paladox) p:05Triage>03High [21:04:44] (03PS2) 10Dzahn: maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [21:12:18] 06Operations, 07Documentation, 07LDAP, 13Patch-For-Review: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788#3089322 (10Dzahn) also see: T160122 [21:13:33] (03PS1) 10Eevans: WIP: TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 [21:15:05] (03CR) 10jerkins-bot: [V: 04-1] WIP: TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 (owner: 10Eevans) [21:16:31] 06Operations, 10RESTBase, 10service-runner, 06Services (doing), 15User-mobrovac: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3089340 (10GWicke) How about going back to syslog over UDP? [21:16:45] (03CR) 10Dzahn: "could you make wgetrc a template instead of a file and use "<%= @site %>" instead of "eqiad" in there to make it flexible?" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [21:17:33] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3089347 (10Paladox) [21:17:39] (03PS2) 10Eevans: WIP: TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 [21:23:42] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3089366 (10Dzahn) [21:29:37] (03PS1) 10Dzahn: site.pp: remove fluorine (decom) [puppet] - 10https://gerrit.wikimedia.org/r/342089 (https://phabricator.wikimedia.org/T159996) [21:29:44] (03PS3) 10Eevans: WIP: TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 [21:30:19] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5716/" [puppet] - 10https://gerrit.wikimedia.org/r/341966 (owner: 10Dzahn) [21:30:24] (03PS3) 10Dzahn: authdns: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341966 [21:33:31] (03CR) 10Mobrovac: WIP: TLS configuration for RESTBase (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342088 (owner: 10Eevans) [21:33:39] (03Draft1) 10Paladox: Gerrit: Ensure review_site is owned by gerrit2:gerrit [puppet] - 10https://gerrit.wikimedia.org/r/342091 [21:33:41] (03CR) 10Dzahn: [C: 032] site.pp: remove fluorine (decom) [puppet] - 10https://gerrit.wikimedia.org/r/342089 (https://phabricator.wikimedia.org/T159996) (owner: 10Dzahn) [21:33:49] (03PS2) 10Paladox: Gerrit: Ensure review_site is owned by gerrit2:gerrit [puppet] - 10https://gerrit.wikimedia.org/r/342091 [21:35:51] !log fluorine - shutdown -h now (decom) T159996 [21:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:58] T159996: decom fluorine - https://phabricator.wikimedia.org/T159996 [21:36:34] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 55.39 seconds [21:37:45] !log fluorine - puppet node clean, puppet node deactivate, salt-key -d, remove from Icinga.. (T159996) [21:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:58] (03PS4) 10Dzahn: authdns: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341966 [21:39:38] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3089445 (10Dzahn) [21:41:39] (03CR) 10Dzahn: [C: 032] remove fluorine prod IP, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/341941 (https://phabricator.wikimedia.org/T159996) (owner: 10Dzahn) [21:42:18] (03PS3) 10Dzahn: remove fluorine prod IP, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/341941 (https://phabricator.wikimedia.org/T159996) [21:43:33] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3089473 (10Dzahn) [21:44:07] (03PS4) 10Dzahn: remove fluorine prod IP, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/341941 (https://phabricator.wikimedia.org/T159996) [21:46:17] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3089498 (10Dzahn) [21:46:32] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#2199946 (10Dzahn) fluorine has been shutdown: count: 3 [21:48:30] (03PS3) 10Paladox: Gerrit: Ensure /var/lib/gerrit2 is owned by gerrit2:gerrit [puppet] - 10https://gerrit.wikimedia.org/r/342091 [21:48:37] (03PS4) 10Paladox: Gerrit: Ensure /var/lib/gerrit2 is owned by gerrit2:gerrit [puppet] - 10https://gerrit.wikimedia.org/r/342091 [21:48:50] mutante RainbowSprinkles ^^ [21:48:53] :) [21:48:53] (03CR) 10Andrew Bogott: [C: 032] WIP Apt: Remove an ensure->absent stanza [puppet] - 10https://gerrit.wikimedia.org/r/336954 (owner: 10Andrew Bogott) [21:51:36] (03PS3) 10Andrew Bogott: Apt: Remove an ensure->absent stanza [puppet] - 10https://gerrit.wikimedia.org/r/336954 [21:51:39] (03PS2) 10Papaul: DNS:Add production dns for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/342081 [21:52:07] (03PS4) 10Eevans: WIP: TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 [21:52:12] (03CR) 10Dzahn: "i can confirm /var/lib/gerrit2/review_site/lib/mysql-connector-java.jar is owned by gerrit2:gerrit2 on cobalt. not so sure yet about runn" [puppet] - 10https://gerrit.wikimedia.org/r/342091 (owner: 10Paladox) [21:52:26] (03CR) 10Chad: [C: 04-1] "I disagree with this approach. Labs should be fixed to have the correct group, production does this just fine already" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342091 (owner: 10Paladox) [21:53:04] (03Abandoned) 10Paladox: Gerrit: Ensure /var/lib/gerrit2 is owned by gerrit2:gerrit [puppet] - 10https://gerrit.wikimedia.org/r/342091 (owner: 10Paladox) [21:54:12] (03Abandoned) 10Mholloway: [Android] Create symlink to repo licenses dir in the SDK on CI [puppet] - 10https://gerrit.wikimedia.org/r/341583 (https://phabricator.wikimedia.org/T147099) (owner: 10Mholloway) [21:54:23] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3089520 (10Dzahn) [21:54:48] !log mobrovac@tin Started deploy [trending-edits/deploy@57a654e]: Bump max_pages for T156411 [21:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:54] T156411: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411 [21:55:41] (03CR) 10Andrew Bogott: [C: 032] Apt: Remove an ensure->absent stanza [puppet] - 10https://gerrit.wikimedia.org/r/336954 (owner: 10Andrew Bogott) [21:56:57] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3089559 (10Dzahn) @Robh per IRC talk, all steps above done (and added a little) up to disabling switch ports. please do that as i just shut the server down a couple minutes ago. please **do NO... [21:57:07] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3089560 (10Dzahn) a:05Dzahn>03RobH [21:59:24] 06Operations, 06Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3089575 (10Krinkle) p:05Triage>03Low [22:00:32] (03CR) 10Mobrovac: [C: 04-1] WIP: TLS configuration for RESTBase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342088 (owner: 10Eevans) [22:00:54] !log mobrovac@tin Finished deploy [trending-edits/deploy@57a654e]: Bump max_pages for T156411 (duration: 06m 07s) [22:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:01] T156411: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411 [22:05:44] 06Operations, 10RESTBase, 10service-runner, 06Services (doing), 15User-mobrovac: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3089589 (10mobrovac) Touche. I vote for the latter. [22:07:01] (03PS3) 10Dzahn: DNS:Add production dns for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/342081 (owner: 10Papaul) [22:09:04] (03CR) 10Dzahn: [C: 032] DNS:Add production dns for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/342081 (owner: 10Papaul) [22:10:02] (03PS1) 10Mobrovac: RESTBase: Send the logs locally to stdout/syslog [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) [22:11:29] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1045 MB (3% inode=51%) [22:11:59] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 40 MB (0% inode=51%) [22:13:49] (03CR) 10Ppchelko: RESTBase: Send the logs locally to stdout/syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac) [22:14:01] (03PS5) 10Eevans: WIP: TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 [22:14:54] uh oh @ prometheus disk space [22:15:30] (03PS2) 10Mobrovac: RESTBase: Send the logs locally to stdout/syslog [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) [22:16:17] (03CR) 10Mobrovac: RESTBase: Send the logs locally to stdout/syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac) [22:17:42] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Segment Navigation Timing data by continent - https://phabricator.wikimedia.org/T128709#3089607 (10Krinkle) [22:18:32] !log maxsem@tin Started deploy [tilerator/deploy@367df80]: no-op [22:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:53] !log maxsem@tin Finished deploy [tilerator/deploy@367df80]: no-op (duration: 00m 22s) [22:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:11] (03PS6) 10Eevans: WIP: TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 [22:24:05] (03PS7) 10Eevans: WIP: TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 [22:28:29] 06Operations, 06Performance-Team, 10Wikimedia-General-or-Unknown: Run EventLogging test to determine best DC for each country - https://phabricator.wikimedia.org/T55497#3089642 (10Krinkle) [22:28:59] (03PS1) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) [22:30:00] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/5724/restbase1009.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac) [22:30:05] (03PS8) 10Eevans: WIP: TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 [22:30:27] (03CR) 10Mobrovac: "Also confirmed in BC that the output ends up in the syslog.log file." [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac) [22:33:26] (03CR) 10GWicke: "See inline question about log levels for stdout." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac) [22:35:53] (03PS3) 10Mobrovac: RESTBase: Send the logs locally to stdout/syslog [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) [22:37:15] (03CR) 10Mobrovac: RESTBase: Send the logs locally to stdout/syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac) [22:40:36] (03PS9) 10Eevans: Cassanra TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) [22:43:13] (03CR) 10Eevans: "PC output: http://puppet-compiler.wmflabs.org/5726/" [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [22:44:00] (03CR) 10Eevans: [C: 04-1] "This isn't ready to be merged; Depend on https://gerrit.wikimedia.org/r/342075" [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [22:45:46] !log maxsem@tin Started deploy [tilerator/deploy@fb06c99]: https://gerrit.wikimedia.org/r/#/c/342140/ [22:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:07] !log maxsem@tin Finished deploy [tilerator/deploy@fb06c99]: https://gerrit.wikimedia.org/r/#/c/342140/ (duration: 00m 21s) [22:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:39] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:46:41] !log prometheus1003 - stopping service: [....] Stopping monitoring system and time series database: prometheusInvalid --pidfile argument: '/var/run/prometheus/prometheus.pid' (Parent directory does not exist) [22:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:06] (03CR) 10GWicke: RESTBase: Send the logs locally to stdout/syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac) [22:48:57] !log maxsem@tin Started deploy [tilerator/deploy@fb06c99]: https://gerrit.wikimedia.org/r/#/c/342140/ [22:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:03] !log maxsem@tin Finished deploy [tilerator/deploy@fb06c99]: https://gerrit.wikimedia.org/r/#/c/342140/ (duration: 00m 05s) [22:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:27] !log prometheus1003/1004 - systemctl stop prometheus (as opposed to /etc/init.d/prometheus), as they are low on disk but are not in production yet [22:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:58] (03CR) 10Krinkle: "Is there a task for the reaper and/or the issue it solves? Would be good to have a short write-up about the data we got from labs (what's " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339245 (owner: 10Aaron Schulz) [22:55:23] (03PS6) 10Krinkle: Include DB shard in production SPI log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz) [23:04:59] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:08:39] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3089733 (10RobH) [23:09:30] 06Operations, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3085886 (10RobH) a:05RobH>03Dzahn The switch port is disabled. Once you have confirmed this wipe can occur, please comment and assign this to @cmjohnson. [23:14:59] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:15:39] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:37:14] (03Abandoned) 10Dzahn: prometheus: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/341965 (owner: 10Dzahn) [23:42:23] (03PS5) 10Dzahn: change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) [23:52:01] (03PS1) 10Krinkle: (no-op) Move NavigationTiming config to EventLogging section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342147 [23:52:03] (03PS1) 10Krinkle: Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) [23:53:09] (03PS1) 10Krinkle: (no-op) Remove setting of unused $wgPercentHHVM (no longer exists) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 [23:53:52] (03PS2) 10Krinkle: [noop] Move NavigationTiming config to EventLogging section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342147 [23:54:01] (03PS2) 10Krinkle: (no-op) Remove setting of unused $wgPercentHHVM (no longer exists) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 [23:54:39] PROBLEM - tileratorui on maps-test2001 is CRITICAL: connect to address 10.192.0.128 and port 6535: Connection refused [23:54:39] PROBLEM - tilerator on maps-test2001 is CRITICAL: connect to address 10.192.0.128 and port 6534: Connection refused [23:55:03] MaxSem: ^ ? [23:58:53] (03PS3) 10Krinkle: [noop] Move NavigationTiming config to EventLogging section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342147 [23:59:04] (03PS3) 10Krinkle: (no-op) Remove setting of unused $wgPercentHHVM (no longer exists) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 [23:59:05] 06Operations, 10media-storage: Sanity check global-multiwrite logs for ConfirmEdit usage - https://phabricator.wikimedia.org/T159830#3089950 (10Reedy) Well, the deletions should've been happening, but it was weird that one file was left. It seems there were some issues getting the captchas stored for whatever...