[00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180208T0000). [00:00:04] James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:09] CUSTOM - Long running screen/tmux on bast4002 is CRITICAL: CRIT: Long running SCREEN process. (PID: 15761, 1960243s 1728000s). [00:00:14] sends CUSTOM message [00:01:28] ACKNOWLEDGEMENT - configured eth on labtestnet2002 is CRITICAL: eth1 reporting no carrier. daniel_zahn TEST USING LABTESTNET [00:01:53] Heya. [00:02:10] guess it's fixed. but waiting for a "natural CRIT" [00:02:41] all of ORES broken seems known .. [00:04:27] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:05:17] though i can just assume that from disabled notifications, could be forgotten from last time, ACKs wouldnt have that issue [00:05:17] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 75040 bytes in 0.375 second response time [00:05:30] ok, icinga-wm looks normal [00:06:19] (03PS1) 10BBlack: eqsin: smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/408952 (https://phabricator.wikimedia.org/T156027) [00:06:51] (03CR) 10BBlack: [C: 032] eqsin: smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/408952 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [00:09:10] 10Operations, 10Traffic, 10Patch-For-Review: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3954594 (10BBlack) Copying from earlier commitlog commentary, known list of TODOs here (minus what's already been done since above): ``` * hieradata/common/cache/*.yaml: eqsin node l... [00:09:37] (03PS5) 10Dzahn: phabricator: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/408947 [00:12:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10User-ArielGlenn: Move labstore1006 and 1007 to 10G enabled racks in row A & D - https://phabricator.wikimedia.org/T186756#3954600 (10madhuvishy) [00:17:54] i guess noone is doing swat? i can then, but i can't stay too long [00:18:15] thankfully it looks like all wmf.20 which should be safe [00:19:41] James_F: ^ [00:20:22] ebernhardson: Thanks. :-) [00:20:44] wmf.20 is on basically no wikis [00:20:46] It's wmf.17 [00:20:59] (so wmf.20 syncs have little effect) [00:21:02] so, deploy and run :) [00:21:45] Better to fix known issues in wmf.20 now, then. :-) [00:22:57] * greg-g side eyes at ebernhardson [00:28:11] ebernhardson: They finally merged. [00:30:12] James_F: anything to test? [00:30:36] ebernhardson: Yeah, sling it up briefly? [00:31:42] James_F: it's pulling to mwdebug1001. But going slower than usual ... [00:32:03] it says "finished rsync common" and it now just waiting for ?? [00:32:59] !log ebernhardson@tin Synchronized php-1.31.0-wmf.20/extensions/CirrusSearch/: T186765: Add special handling for profiles into config dump (duration: 01m 27s) [00:33:02] ebernhardson: … well, it worked. [00:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:12] T186765: Cross-wiki search profile issues - https://phabricator.wikimedia.org/T186765 [00:33:25] James_F: and now it finally returned! i wonder what it was waiting on ... syncing out to the rest of the cluster [00:33:43] ebernhardson: Both patches tested and working. Sync away. [00:35:10] !log ebernhardson@tin Synchronized php-1.31.0-wmf.20/extensions/VisualEditor/: Revert "Use wgEditSubmitButtonLabelPublish from upstream", Assume wpTextbox1 has an API registered already (duration: 01m 12s) [00:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:30] ebernhardson: Was that you syncing both? [00:37:38] James_F: yup that's both VE patches [00:38:03] ebernhardson: Awesome. Thank you! [00:52:33] 10Operations, 10Page-Previews, 10RESTBase, 10Traffic, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#3954646 (10Pchelolo) I've done a little bit more research here and Varnish docs actually confirm that `age` header can effectively disallow the clien... [00:56:38] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#3954661 (10Pchelolo) [01:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180208T0100). [01:00:04] No GERRIT patches in the queue for this window AFAICS. [01:00:50] (03CR) 10VolkerE: "Nor of caution: Depending on the SVG export you could end up with more blurred versions of the icons (haven't checked them 1:1)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402618 (https://phabricator.wikimedia.org/T177726) (owner: 10Odder) [01:16:34] 10Operations, 10Page-Previews, 10RESTBase, 10Traffic, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#3954704 (10BBlack) I don't know if that sounds like quite the right answer, I think this needs more thinking/info about what behaviors we're trying t... [01:24:24] (03PS1) 10Andrew Bogott: labweb: use a different location for static horizon content [puppet] - 10https://gerrit.wikimedia.org/r/408962 [01:26:01] (03CR) 10Andrew Bogott: [C: 032] labweb: use a different location for static horizon content [puppet] - 10https://gerrit.wikimedia.org/r/408962 (owner: 10Andrew Bogott) [01:29:23] !log andrew@tin Started deploy [horizon/deploy@9223ba7]: Now with static content, I hope [01:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:28] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:30:38] !log andrew@tin Finished deploy [horizon/deploy@9223ba7]: Now with static content, I hope (duration: 01m 15s) [01:30:40] !log andrew@tin Started deploy [horizon/deploy@9223ba7]: Now with static content, I hope [01:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:28] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:27] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 75164 bytes in 0.292 second response time [01:55:27] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:59:05] (03PS6) 10Dzahn: phabricator: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/408947 [01:59:39] (03PS1) 10Andrew Bogott: Add another rule to horizon/queens/designate policy.json [puppet] - 10https://gerrit.wikimedia.org/r/408963 [02:00:45] (03CR) 10Andrew Bogott: [C: 032] Add another rule to horizon/queens/designate policy.json [puppet] - 10https://gerrit.wikimedia.org/r/408963 (owner: 10Andrew Bogott) [02:04:36] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/9893/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/408947 (owner: 10Dzahn) [02:05:32] (03CR) 10Dzahn: [C: 031] "wmf-style: total violations delta -6" [puppet] - 10https://gerrit.wikimedia.org/r/408947 (owner: 10Dzahn) [02:26:33] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.17) (duration: 05m 58s) [02:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:57] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 896.72 seconds [03:35:07] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [03:35:47] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [03:41:27] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [03:41:37] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [03:56:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 255.74 seconds [04:21:37] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:22:07] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [04:29:18] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:29:27] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [04:30:37] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:31:27] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [04:37:07] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:37:27] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [04:59:07] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:59:17] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [05:06:57] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [05:07:07] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [05:08:17] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [05:09:17] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [05:36:10] google? [07:12:47] (03PS2) 10Elukey: profile::analytics::refinery::job::json_refine: standardize netflow conf [puppet] - 10https://gerrit.wikimedia.org/r/408836 (https://phabricator.wikimedia.org/T181036) [07:13:52] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::json_refine: standardize netflow conf [puppet] - 10https://gerrit.wikimedia.org/r/408836 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [07:17:14] 10Operations, 10ops-eqiad, 10hardware-requests: decom fluorine - https://phabricator.wikimedia.org/T159996#3954868 (10Krinkle) [07:27:33] <_joe_> !log depooled mw1256 from traffic, scap (faulty disk, T186535); now powering it off [07:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:46] T186535: OfflineUncorrectableSector on mw1256 sda - https://phabricator.wikimedia.org/T186535 [07:27:58] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3954893 (10Krinkle) [07:31:44] (03PS1) 10Marostegui: s4.hosts: Remove blank line [software] - 10https://gerrit.wikimedia.org/r/408975 [07:33:16] (03CR) 10Marostegui: [C: 032] s4.hosts: Remove blank line [software] - 10https://gerrit.wikimedia.org/r/408975 (owner: 10Marostegui) [07:34:10] (03Merged) 10jenkins-bot: s4.hosts: Remove blank line [software] - 10https://gerrit.wikimedia.org/r/408975 (owner: 10Marostegui) [08:07:42] !log upgrade remaining app servers to HHVM 3.18.7 [08:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:49] hello people, going to try again to copy data from two partitions on meitnerium [08:33:00] the last time ganeti1005 did not like it [08:43:19] (03PS2) 10Gehel: maps: Bump maximum zoom to 19 [puppet] - 10https://gerrit.wikimedia.org/r/394948 (https://phabricator.wikimedia.org/T180907) [08:43:29] (using rsync -av --bwlimit, let's see how it goes) [08:57:37] (03CR) 10Gehel: [C: 032] maps: Bump maximum zoom to 19 [puppet] - 10https://gerrit.wikimedia.org/r/394948 (https://phabricator.wikimedia.org/T180907) (owner: 10Gehel) [09:03:16] (03PS2) 10Gehel: maps: backport cache header configuration from upstream [puppet] - 10https://gerrit.wikimedia.org/r/408832 (https://phabricator.wikimedia.org/T108435) [09:04:13] (03CR) 10Gehel: [C: 032] maps: backport cache header configuration from upstream [puppet] - 10https://gerrit.wikimedia.org/r/408832 (https://phabricator.wikimedia.org/T108435) (owner: 10Gehel) [09:11:31] (03PS1) 10Elukey: profile::analytics::refinery::job::json_refine: change netflow's db [puppet] - 10https://gerrit.wikimedia.org/r/408981 (https://phabricator.wikimedia.org/T181036) [09:12:46] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::json_refine: change netflow's db [puppet] - 10https://gerrit.wikimedia.org/r/408981 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [09:23:27] good luck elukey ! [09:29:30] (03CR) 10Filippo Giunchedi: "I'm assuming we'll be deleting the repo from gerrit itself, so the files should be gone anyway at that point without needed to commit a de" [software/gdash] - 10https://gerrit.wikimedia.org/r/408787 (https://phabricator.wikimedia.org/T186696) (owner: 10MarcoAurelio) [09:30:35] Morning [09:30:41] Any planned downtime? [09:30:56] I am getting very slow performance when doing a TLS handshake [09:31:30] <_joe_> ShakespeareFan00: planned downtimes of such nature would be communicated well in advance [09:32:46] <_joe_> ShakespeareFan00: if I had to guess, it's a client-side network issue. I say client-side as I reach the same caching DCs as you and everything works like a breeze [09:32:56] <_joe_> (unless you've changed location recently) [09:35:35] (03PS1) 10KartikMistry: Add apertium-ukr and apertium-rus-ukr packages [puppet] - 10https://gerrit.wikimedia.org/r/408986 (https://phabricator.wikimedia.org/T184901) [09:37:28] PROBLEM - DPKG on mw1275 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:37:47] PROBLEM - DPKG on mw1274 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:37:48] PROBLEM - DPKG on mw1278 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:38:04] ^that's me, incomplete downtime [09:38:28] RECOVERY - DPKG on mw1275 is OK: All packages OK [09:38:48] RECOVERY - DPKG on mw1274 is OK: All packages OK [09:38:50] RECOVERY - DPKG on mw1278 is OK: All packages OK [09:46:21] (03PS1) 10Volans: Documentation: migrate ReadTheDocs to Python3 [software/cumin] - 10https://gerrit.wikimedia.org/r/408987 [09:57:55] 10Operations, 10Maps-Sprint, 10Traffic: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3955088 (10Gehel) I'm not sure that our varnish configuration honors `stale-while-revalidate` headers. A quick look through the code shows a [[ https://github.com/wikimedia/puppet... [10:04:26] (03PS1) 10Ema: icinga-downtime: do not wait for two log lines [puppet] - 10https://gerrit.wikimedia.org/r/408989 (https://phabricator.wikimedia.org/T145192) [10:07:38] !log upgrading remaining nginx-full packages on mw* in eqiad to 1.13.6-2+wmf1~jessie1 [10:07:49] !log Drop deleted databases from sanitarium and labsdb hosts - T186685 [10:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:03] T186685: Remove deleted wikis from wikireplicas - https://phabricator.wikimedia.org/T186685 [10:09:48] mutante: trying to add you as a reviewer for https://gerrit.wikimedia.org/r/c/408989/ but polygerrit's UI doesn't seem to find dzahn [10:11:22] lol [10:15:57] (03CR) 10Giuseppe Lavagetto: Refactor conftool.action, add the edit action (038 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/405303 (owner: 10Giuseppe Lavagetto) [10:17:29] (03CR) 10Volans: [C: 032] Documentation: migrate ReadTheDocs to Python3 [software/cumin] - 10https://gerrit.wikimedia.org/r/408987 (owner: 10Volans) [10:19:50] (03Merged) 10jenkins-bot: Documentation: migrate ReadTheDocs to Python3 [software/cumin] - 10https://gerrit.wikimedia.org/r/408987 (owner: 10Volans) [10:20:06] (03CR) 10jenkins-bot: Documentation: migrate ReadTheDocs to Python3 [software/cumin] - 10https://gerrit.wikimedia.org/r/408987 (owner: 10Volans) [10:24:16] (03PS2) 10Filippo Giunchedi: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:24:28] (03CR) 10Volans: "I'm not sure is the best mid/long term solution here, open for discussion though." [puppet] - 10https://gerrit.wikimedia.org/r/408989 (https://phabricator.wikimedia.org/T145192) (owner: 10Ema) [10:25:26] 10Operations, 10ops-eqiad, 10Patch-For-Review: Offline uncorrectable sectors on poolcounter1002 /dev/sda - https://phabricator.wikimedia.org/T186534#3955127 (10fgiunchedi) @Cmjohnson ok! I'll merge https://gerrit.wikimedia.org/r/404967 early next week to depool the machine and let you know. [10:25:43] hashar: over to you, sir ;) https://gerrit.wikimedia.org/r/c/408988/ [10:25:52] (03CR) 10jerkins-bot: [V: 04-1] Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:27:49] (03PS3) 10Filippo Giunchedi: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:30:10] (03CR) 10jerkins-bot: [V: 04-1] Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:33:14] (03PS4) 10Filippo Giunchedi: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:35:26] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:38:09] volans: deploying it :) [10:38:16] thanks! [10:38:46] that applies to both jobs right? commit publish and tag publish [10:39:07] done [10:39:14] probably yes? :) [10:39:26] that refreshed cumin-tox-publish cumin-tox-tag-publish [10:39:52] <_joe_> /win 25 [10:40:03] makes sense, perfect [10:42:47] ok I think that I made ganet1005 crash [10:43:06] I tried multiple rsync --bw-limit [10:43:16] up to 10MB/s everything is fine [10:43:28] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:43:30] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:43:30] PROBLEM - Host kubestagetcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:43:30] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [10:43:30] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:43:30] PROBLEM - Host meitnerium is DOWN: PING CRITICAL - Packet loss = 100% [10:43:33] yeah there you go [10:43:37] akosiaris: --^ [10:43:48] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:43:48] PROBLEM - Host ununpentium is DOWN: PING CRITICAL - Packet loss = 100% [10:43:57] PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100% [10:43:58] :( [10:44:02] need help? [10:44:17] PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:44:22] it should auto-recover afaik, going to check [10:44:46] this was triggered by a rsync -av dir1 dir2 without --bw-limit [10:45:31] we should also check lvses after etcd1003 will be back (cc. ema ) [10:45:56] aha [10:46:02] that's not a pybal etcd node volans [10:46:05] no? [10:46:15] yeah it's kubernetes [10:46:20] I am a task open to rename it [10:46:21] sorry I alwyas mix them up [10:46:22] my bad [10:46:27] I have* [10:46:46] mailman doesn't work for me, is there maintenance ongoing? [10:46:49] the wrong name is confNNNN IMHO, not etcd :-P [10:46:56] * ema dances [10:47:06] jynus: look a bit further up, fermium is in the list [10:47:17] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 1332295 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:47:48] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 194887 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:48:03] akosiaris: sorry, didn't saw that [10:48:46] (03CR) 10Giuseppe Lavagetto: Safely load yaml files (034 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/408290 (owner: 10Giuseppe Lavagetto) [10:48:47] RECOVERY - etcd request latencies on argon is OK: OK - etcd_request_latencies is 27405 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:49:17] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 40697 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:49:18] ok at least we finally have a way to reproduce this [10:49:44] didn't things on codfw started failing after some mysql work was ongoing? [10:49:44] I am wondering now if it's the read or the write activity (or both). Easy enough to figure out I guess [10:50:02] which could signify io activity, too [10:50:42] I'm wondering if it's related to DRBD, that's what 1005-1008 sets apart from the servers where we don [10:50:47] t see any crashes [10:51:10] hashar: is there any magic work I can use on Gerrit to re-trigger the post-merge build? (or I just wait the next merge ;) ) [10:51:27] +1 to that guess, bad experience in the past with DRBD [10:51:34] s/work/word/ [10:51:47] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 1.83 ms [10:51:49] RECOVERY - Host etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [10:52:17] RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [10:54:06] the rest of the hosts still not back up [10:54:58] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/smartmontools/run.d/20logger] [10:55:24] ugh, I'll take a look at that [10:55:38] RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [10:55:47] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [10:55:48] RECOVERY - Host ununpentium is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [10:55:57] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [10:55:57] RECOVERY - Host kubestagetcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [10:55:57] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [10:56:07] RECOVERY - Host meitnerium is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [10:56:27] volans: we can do it manually via a CLI on the Zuul server [10:56:33] volans: beside that no, there is no other way :D [10:56:47] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 423162 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:57:03] volans: I am retriggering https://gerrit.wikimedia.org/r/#/c/408987/ [10:57:30] ok, thanks a lot hashar ! [10:57:47] RECOVERY - etcd request latencies on argon is OK: OK - etcd_request_latencies is 5781 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:57:49] zuul enqueue --trigger gerrit --pipeline postmerge --project operations/software/cumin --change 408987,1 [10:57:53] that is the magic command :] [10:57:59] ehehehe [10:58:07] https://integration.wikimedia.org/ci/job/cumin-tox-publish/23/console [10:58:09] * volans don't want to mess with zuul [10:58:21] akosiaris: as soon as everything is back to normal and meitnerium is ready to go let me know so I'll restart the rsync with proper bw-limit :) [10:59:01] (03CR) 10jenkins-bot: Documentation: migrate ReadTheDocs to Python3 [software/cumin] - 10https://gerrit.wikimedia.org/r/408987 (owner: 10Volans) [10:59:02] nothing urgent, it can also be tomorrow [10:59:22] volans: looks like something got generated properly https://doc.wikimedia.org/cumin/master/ [10:59:27] volans: with the version still being off because the job does not use a full git clone [10:59:57] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:59:59] !log Drop wikidata renamed tables and database from s5 eqiad hosts - T184599 [11:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:13] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [11:00:13] hashar: yeah! checking, the version is still wrong, but you know that already :) [11:00:27] PROBLEM - Request latencies on neon is CRITICAL: CRITICAL - apiserver_request_latencies is 7049927 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:00:27] PROBLEM - etcd request latencies on neon is CRITICAL: CRITICAL - etcd_request_latencies is 8418175 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:00:38] (03PS5) 10Ema: Varnish: swizzle TTLs by 5% [puppet] - 10https://gerrit.wikimedia.org/r/408810 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [11:00:56] (03CR) 10Ema: [V: 032 C: 032] Varnish: swizzle TTLs by 5% [puppet] - 10https://gerrit.wikimedia.org/r/408810 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [11:02:01] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3955170 (10MoritzMuehlenhoff) [11:02:20] elukey akosiaris FWIW since page allocation stalls show up in the kernel log I think it'd be interesting setting swappiness to 1 instead of 0 and see if that's more tolerant of pressure [11:02:28] cc moritzm ^ [11:02:42] +1 [11:02:47] (03CR) 10MarcoAurelio: "As far as I know, we normally do not delete repositories, we just archive them. You can commit this if you wish (leaves the repo with .git" [software/gdash] - 10https://gerrit.wikimedia.org/r/408787 (https://phabricator.wikimedia.org/T186696) (owner: 10MarcoAurelio) [11:04:27] RECOVERY - Request latencies on neon is OK: OK - apiserver_request_latencies is 7997 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:04:27] RECOVERY - etcd request latencies on neon is OK: OK - etcd_request_latencies is 5365 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:04:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408990 [11:04:47] (03CR) 10Marostegui: [V: 032 C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408990 (owner: 10Marostegui) [11:05:26] (03PS2) 10Giuseppe Lavagetto: cli.tool: drop the "find" interface [software/conftool] - 10https://gerrit.wikimedia.org/r/405301 [11:05:27] (03PS2) 10Giuseppe Lavagetto: Add preemptive validation. [software/conftool] - 10https://gerrit.wikimedia.org/r/405302 (https://phabricator.wikimedia.org/T185080) [11:05:29] (03PS4) 10Giuseppe Lavagetto: Refactor conftool.action, add the edit action [software/conftool] - 10https://gerrit.wikimedia.org/r/405303 [11:05:31] (03PS2) 10Giuseppe Lavagetto: Safely load yaml files [software/conftool] - 10https://gerrit.wikimedia.org/r/408290 [11:05:37] (03PS3) 10Giuseppe Lavagetto: [WiP] Add support for jsonschema-based entities [software/conftool] - 10https://gerrit.wikimedia.org/r/408585 [11:05:43] godog: what is the rationale behind? (asking to have clear ideas and understand your point) [11:06:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 (duration: 01m 12s) [11:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:46] the swappiness =0 use a complete different path in kernel code compared to swappiness 1-100 regarding memory allocation [11:07:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408990 (owner: 10Marostegui) [11:07:15] yup, what volans said, from linux 3.5 onwards [11:07:41] I'll update the task [11:09:33] I'm no expert on the details, and it might not solve the issue, but I think is worth a test. Under memory pressure it might be able to survive swapping some pages even with a very low swappiness. At least that's the general idea [11:10:14] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3780358 (10fgiunchedi) Looks like we have a repro case, namely an rsync without bandwidth limits triggers page allocation stalls / crash. Since `vm.swappiness` is set at 0 across the f... [11:10:53] thanks :) [11:11:21] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3955178 (10akosiaris) >>! In T181121#3955175, @fgiunchedi wrote: > Looks like we have a repro case, namely an rsync without bandwidth limits triggers page allocation stalls / crash. Si... [11:12:20] !log migrate all running VMs off ganeti1005 T181121 [11:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:32] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [11:13:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408993 [11:15:38] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408993 (owner: 10Marostegui) [11:18:03] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408993 (owner: 10Marostegui) [11:18:13] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408993 (owner: 10Marostegui) [11:19:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 (duration: 01m 11s) [11:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408994 (https://phabricator.wikimedia.org/T184599) [11:23:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408994 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [11:24:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408994 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [11:26:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 - T184599 (duration: 01m 11s) [11:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:42] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [11:26:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408994 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [11:30:00] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408996 [11:32:17] PROBLEM - Host ganeti1005 is DOWN: PING CRITICAL - Packet loss = 100% [11:33:08] down again? [11:33:12] me [11:33:13] ah the reboot [11:33:17] RECOVERY - Host ganeti1005 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [11:33:23] !log reboot ganeti1005 T181121 [11:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:35] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [11:33:53] I 'll move just meitnerium back for a bit and let's try to reproduce with just 1 VM first [11:36:10] I have to go to lunch but will be back in an hour or so [11:37:26] !log Fix replication on labsdb1010 - T186579 [11:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:40] T186579: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579 [11:37:42] godog: replied you wrt the gdash patch :) [11:37:47] good lunch time [11:38:07] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408996 (owner: 10Marostegui) [11:40:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408996 (owner: 10Marostegui) [11:40:21] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408996 (owner: 10Marostegui) [11:42:10] (03CR) 10Hashar: [C: 032] support process a lintian output file [debs/jenkins-debian-glue] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408246 (https://phabricator.wikimedia.org/T186494) (owner: 10Hashar) [11:42:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1096:3315 - T184599 (duration: 01m 11s) [11:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:25] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [11:42:29] (03CR) 10Hashar: [C: 032] Hook to run lintian [debs/jenkins-debian-glue] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408260 (owner: 10Hashar) [11:42:49] (03CR) 10Hashar: [C: 032] 0.18.4-wmf1: support process a lintian output file [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408247 (https://phabricator.wikimedia.org/T186494) (owner: 10Hashar) [11:43:03] (03CR) 10Hashar: "recheck" [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408261 (https://phabricator.wikimedia.org/T186494) (owner: 10Hashar) [11:43:12] (03CR) 10Hashar: [C: 032] "recheck" [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408247 (https://phabricator.wikimedia.org/T186494) (owner: 10Hashar) [11:43:37] elukey: wanna redo whatever you did to meitnerium ? It's all alone on ganeti1005 waiting for you [11:45:06] shall we retry with swappiness=1 or first repro with the current value? [11:47:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408998 (https://phabricator.wikimedia.org/T184599) [11:50:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408998 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [11:50:27] PROBLEM - Nginx local proxy to apache on mw2211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:51:17] RECOVERY - Nginx local proxy to apache on mw2211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.201 second response time [11:51:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408998 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [11:51:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408998 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [11:52:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 - T184599 (duration: 01m 11s) [11:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:11] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [11:53:33] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409000 [11:53:55] (03CR) 10Hashar: [C: 032] "recheck" [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408247 (https://phabricator.wikimedia.org/T186494) (owner: 10Hashar) [11:54:07] (03PS2) 10Alexandros Kosiaris: Add addshore to contint-docker admins [puppet] - 10https://gerrit.wikimedia.org/r/408823 (https://phabricator.wikimedia.org/T186475) [11:54:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add addshore to contint-docker admins [puppet] - 10https://gerrit.wikimedia.org/r/408823 (https://phabricator.wikimedia.org/T186475) (owner: 10Alexandros Kosiaris) [11:54:16] (03CR) 10Hashar: [C: 032] 0.18.4-wmf2: add hook B90lintian [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408261 (https://phabricator.wikimedia.org/T186494) (owner: 10Hashar) [11:54:30] (03CR) 10Hashar: [C: 032] 0.18.4-wmf1: support process a lintian output file [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408247 (https://phabricator.wikimedia.org/T186494) (owner: 10Hashar) [11:54:37] (03Merged) 10jenkins-bot: 0.18.4-wmf1: support process a lintian output file [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408247 (https://phabricator.wikimedia.org/T186494) (owner: 10Hashar) [11:54:48] jouncebot: next [11:54:48] In 2 hour(s) and 5 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180208T1400) [11:55:04] (03Merged) 10jenkins-bot: 0.18.4-wmf2: add hook B90lintian [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/408261 (https://phabricator.wikimedia.org/T186494) (owner: 10Hashar) [11:55:05] moritzm: current value I 'd say. And then swapiness=1 [11:55:45] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409000 (owner: 10Marostegui) [11:56:00] ok, makes sense [11:57:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409000 (owner: 10Marostegui) [11:59:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409000 (owner: 10Marostegui) [11:59:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 - T184599 (duration: 01m 11s) [12:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:00] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [12:00:14] (03PS4) 10Volans: Backends: add known hosts files backend [software/cumin] - 10https://gerrit.wikimedia.org/r/405719 [12:02:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409001 (https://phabricator.wikimedia.org/T184599) [12:04:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409001 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [12:06:23] (03CR) 10Volans: [C: 031] "LGTM, just ensure we're not using the CLI 'find' from elsewhere or convert those to a select." [software/conftool] - 10https://gerrit.wikimedia.org/r/405301 (owner: 10Giuseppe Lavagetto) [12:06:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409001 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [12:07:08] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409001 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [12:08:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1100 - T184599 (duration: 01m 11s) [12:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:16] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [12:08:22] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Zuul: Upload new zuul and jenkins-debian-glue packages to apt.wikimedia.org - https://phabricator.wikimedia.org/T186786#3955291 (10hashar) [12:08:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409002 [12:09:05] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Zuul: Upload new zuul and jenkins-debian-glue packages to apt.wikimedia.org - https://phabricator.wikimedia.org/T186786#3955291 (10hashar) [12:09:12] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Zuul: Upload new zuul and jenkins-debian-glue packages to apt.wikimedia.org - https://phabricator.wikimedia.org/T186786#3955291 (10hashar) [12:09:24] (03CR) 10Volans: [C: 031] "LGTM, one nitpick inline" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/405302 (https://phabricator.wikimedia.org/T185080) (owner: 10Giuseppe Lavagetto) [12:11:09] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409002 (owner: 10Marostegui) [12:12:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409002 (owner: 10Marostegui) [12:12:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409002 (owner: 10Marostegui) [12:14:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1100 - T184599 (duration: 01m 11s) [12:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:21] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [12:14:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409005 (https://phabricator.wikimedia.org/T184599) [12:16:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409005 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [12:18:00] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409005 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [12:18:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409007 [12:19:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106 - T184599 (duration: 01m 11s) [12:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:35] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [12:20:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409005 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [12:21:15] (03PS1) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [12:22:21] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [12:24:28] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 19 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [12:24:28] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 27 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [12:27:11] (03PS2) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [12:29:24] (03CR) 10Volans: "Thanks for the fixes, looks better now. See replies inline, basically just one thing I'd change." (035 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/405303 (owner: 10Giuseppe Lavagetto) [12:29:27] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:29:28] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:29:28] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:29:28] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:30:14] lalakompot? is this new? akosiaris [12:30:29] volans: you are fast I see [12:30:36] :D [12:32:09] 10Operations, 10Maps-Sprint, 10Traffic: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3955341 (10BBlack) Yeah it doesn't honor `stale-while-invalidate` directly at this time. It does implement `stale-while-revalidate` -like behavior for all cache objects, but it's... [12:32:28] one final test and I think we are ok [12:32:35] ah great [12:34:17] (03PS16) 10Alexandros Kosiaris: ircecho: Support ssl when connecting to irc [puppet] - 10https://gerrit.wikimedia.org/r/405591 (owner: 10Paladox) [12:34:34] (03CR) 10Alexandros Kosiaris: [C: 032] "Just tested it on einsteinium, worked fine. Merging" [puppet] - 10https://gerrit.wikimedia.org/r/405591 (owner: 10Paladox) [12:34:50] \o/ [12:36:51] (03PS3) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [12:38:34] akosiaris: thanks :) [12:38:44] yw [12:39:21] (03PS4) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [12:39:45] (03PS3) 10Paladox: ircecho: Enable ssl by default [puppet] - 10https://gerrit.wikimedia.org/r/405593 [12:39:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some inline comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405593 (owner: 10Paladox) [12:40:01] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Enable ssl by default [puppet] - 10https://gerrit.wikimedia.org/r/405593 (owner: 10Paladox) [12:40:03] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409007 (owner: 10Marostegui) [12:40:12] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [12:40:32] (03PS4) 10Paladox: ircecho: Enable ssl by default [puppet] - 10https://gerrit.wikimedia.org/r/405593 [12:41:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409007 (owner: 10Marostegui) [12:41:54] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409007 (owner: 10Marostegui) [12:42:58] (03PS5) 10Paladox: ircecho: Enable ssl by default [puppet] - 10https://gerrit.wikimedia.org/r/405593 [12:43:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 - T184599 (duration: 01m 11s) [12:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:16] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [12:43:21] akosiaris: I think I do ^^ correctly [12:43:32] (Not sure if I got it in all the places) [12:43:50] (03CR) 10Hashar: [C: 04-2] "recheck" [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar) [12:44:09] paladox: see my comment about icinga::ircbot [12:44:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409011 (https://phabricator.wikimedia.org/T184599) [12:45:18] akosiaris: thanks [12:46:21] (03CR) 10Volans: [C: 031] "LGTM already, see a couple of possible improvements inline." (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/408290 (owner: 10Giuseppe Lavagetto) [12:46:28] (03PS6) 10Paladox: ircecho: Enable ssl by default [puppet] - 10https://gerrit.wikimedia.org/r/405593 [12:46:33] akosiaris: done :) [12:46:42] (03PS5) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [12:46:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409011 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [12:47:20] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [12:47:23] (03CR) 10Alexandros Kosiaris: [C: 032] ircecho: Enable ssl by default [puppet] - 10https://gerrit.wikimedia.org/r/405593 (owner: 10Paladox) [12:47:46] Thanks :) [12:47:54] (03CR) 10Jcrespo: [C: 04-1] "Needs moving more stuff to profile and clean up the multiinstance profile." [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [12:48:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409011 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [12:48:58] (03CR) 10Paladox: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/405594 (owner: 10Paladox) [12:49:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409011 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [12:49:16] akosiaris: sorry I was afk! Shall I retry now? [12:49:24] elukey: yes please [12:49:32] I imagine full power without --bw-limit right? [12:49:33] akosiaris: ^^ for the last one, I have a question for one of the comments. [12:49:40] elukey: yes [12:49:43] ack! [12:49:47] paladox: ? [12:49:48] (03CR) 10Volans: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/408989 (https://phabricator.wikimedia.org/T145192) (owner: 10Ema) [12:50:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1110 - T184599 (duration: 01m 11s) [12:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:19] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [12:51:06] akosiaris: https://gerrit.wikimedia.org/r/405594 [12:51:07] paladox: confirmed that 6697 and SSL is being used for icinga-wm [12:51:38] :) [12:51:46] oh, that's a long one, I have it in my todo list to answer but it will take a while [12:52:05] Ok [12:53:35] started the rsync, still checking what files need to be copied [12:55:11] 10Operations, 10Traffic: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3955364 (10MoritzMuehlenhoff) > First is upgrade tlsproxy hosts to `1.13.6-2+wmf1` (but still on existing `nginx-full` packages) I've upgraded all of mw* to 1.13.6-2+wmf1~jessie1 , this leaves only conf* to be upgraded... [12:57:11] (03CR) 10Marostegui: "So far everything looks good to me!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [12:57:26] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409012 [12:57:43] akosiaris: seems completed, without any issues. I can do another test, namely removing the 45G and rsync again [12:57:50] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [software/gdash] - 10https://gerrit.wikimedia.org/r/408787 (https://phabricator.wikimedia.org/T186696) (owner: 10MarcoAurelio) [12:58:15] in any case - sent 33,187,929,825 bytes received 5,319,860 bytes 146,548,563.73 bytes/sec [12:58:19] elukey: hm, ok but give me some time first to fill the server with the normal vms [12:58:26] akosiaris: sure [12:59:44] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409012 (owner: 10Marostegui) [12:59:57] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:01:51] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409012 (owner: 10Marostegui) [13:02:01] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409012 (owner: 10Marostegui) [13:02:53] !log upgrade mwdebug servers to HHVM 3.18.7 [13:03:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409013 (https://phabricator.wikimedia.org/T184599) [13:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1110 - T184599 (duration: 01m 11s) [13:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:31] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [13:05:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409013 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [13:07:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409013 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [13:09:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409015 [13:09:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T184599 (duration: 01m 11s) [13:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:31] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [13:09:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409013 (https://phabricator.wikimedia.org/T184599) (owner: 10Marostegui) [13:09:51] !log upgrade deployment servers and script runners to HHVM 3.18.7 [13:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409015 (owner: 10Marostegui) [13:14:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409015 (owner: 10Marostegui) [13:14:32] (03PS1) 10Marostegui: db1073: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/409016 [13:15:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 - T184599 (duration: 01m 12s) [13:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:51] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [13:15:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409017 (https://phabricator.wikimedia.org/T162807) [13:17:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409015 (owner: 10Marostegui) [13:18:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409017 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [13:18:30] (03PS1) 10Muehlenhoff: Add apt configuration to switch deployment-prep to the ICU57-enabled HHVM build [puppet] - 10https://gerrit.wikimedia.org/r/409018 [13:19:25] (03CR) 10jerkins-bot: [V: 04-1] Add apt configuration to switch deployment-prep to the ICU57-enabled HHVM build [puppet] - 10https://gerrit.wikimedia.org/r/409018 (owner: 10Muehlenhoff) [13:20:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409017 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [13:20:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409017 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [13:21:10] (03CR) 10Hashar: [C: 032] Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar) [13:21:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T162807 (duration: 01m 12s) [13:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:58] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [13:22:07] !log Fixing data drifts on db1073, also upgrade kernel, socket location and mysql - T162807 [13:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:31] (03Merged) 10jenkins-bot: Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar) [13:24:23] (03PS2) 10Muehlenhoff: Add apt configuration to switch deployment-prep to the ICU57-enabled HHVM build [puppet] - 10https://gerrit.wikimedia.org/r/409018 [13:25:32] (03CR) 10Marostegui: [C: 032] db1073: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/409016 (owner: 10Marostegui) [13:28:38] (03CR) 10MarcoAurelio: " and past practices indicates that emp" [software/gdash] - 10https://gerrit.wikimedia.org/r/408787 (https://phabricator.wikimedia.org/T186696) (owner: 10MarcoAurelio) [13:32:29] (03PS1) 10Muehlenhoff: Enable base::firewall for labweb* [puppet] - 10https://gerrit.wikimedia.org/r/409026 [13:33:27] (03PS1) 10Elukey: role::cache::misc: add a ad-hoc varnishkafka instance to test TLS [puppet] - 10https://gerrit.wikimedia.org/r/409027 (https://phabricator.wikimedia.org/T185136) [13:33:33] (03CR) 10Rush: [C: 031] "Thanks arturo for working on this, I think all the previous remaining concerns have been addressed and we should give this go." [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:35:05] (03PS12) 10Arturo Borrero Gonzalez: apt: merge report-pending-upgrades script into apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) [13:36:33] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: merge report-pending-upgrades script into apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:37:50] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/9894/cp1045.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/409027 (https://phabricator.wikimedia.org/T185136) (owner: 10Elukey) [13:41:06] !log Drop dewiki already renamed tables and database on s8 master (db1071) - T184599 [13:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:19] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [13:43:26] (03PS1) 10ArielGlenn: copy lists of file hashes into place before they are used for status reports [dumps] - 10https://gerrit.wikimedia.org/r/409028 (https://phabricator.wikimedia.org/T185454) [13:43:48] (03CR) 10Jcrespo: [C: 04-1] "The suggestions are mostly things I planned to do, there is more work pending to make these like core hosts." [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [13:43:50] (03CR) 10Elukey: "This is causing daily cronspam from rhenium :)" [puppet] - 10https://gerrit.wikimedia.org/r/408771 (owner: 10Elukey) [13:44:04] (03PS1) 10Marostegui: db-eqiad.php: Repool db1073 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409029 (https://phabricator.wikimedia.org/T162807) [13:45:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1073 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409029 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [13:47:11] (03PS1) 10Ema: cache::canary: test varnish downgrade on pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/409032 [13:47:51] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1073 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409029 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [13:48:02] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1073 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409029 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [13:49:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 with low weight - T162807 (duration: 01m 11s) [13:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:31] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [13:49:39] (03CR) 10Ema: [C: 032] cache::canary: test varnish downgrade on pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/409032 (owner: 10Ema) [13:50:02] FYI: https://tools.wmflabs.org/versions/ lists group0 at wmf.20 but it appears to be at wmf.17 [13:51:57] elukey: we are good btw, go ahead with the rsyncs [13:52:01] (03CR) 10ArielGlenn: [C: 032] copy lists of file hashes into place before they are used for status reports [dumps] - 10https://gerrit.wikimedia.org/r/409028 (https://phabricator.wikimedia.org/T185454) (owner: 10ArielGlenn) [13:52:16] akosiaris: full power, limited ? [13:53:44] !log ariel@tin Started deploy [dumps/dumps@9b7841f]: make sure all hashes appear in dumpstatus file , T185454 [13:53:47] !log ariel@tin Finished deploy [dumps/dumps@9b7841f]: make sure all hashes appear in dumpstatus file , T185454 (duration: 00m 02s) [13:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:56] T185454: md5 and sha1 checksums are not available in dumpstatus.json for multistream dumps - https://phabricator.wikimedia.org/T185454 [13:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:22] elukey: full throttle!!!! [13:54:43] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-fgiunchedi: Offline uncorrectable sectors on poolcounter1002 /dev/sda - https://phabricator.wikimedia.org/T186534#3955486 (10fgiunchedi) [13:54:49] !log Rename dewiki tables on s8 slaves - T184599 [13:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:04] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [13:55:32] (03PS1) 10Ema: Revert "cache::canary: test varnish downgrade on pinkunicorn" [puppet] - 10https://gerrit.wikimedia.org/r/409033 [13:56:09] akosiaris: in progress! [13:56:45] (03CR) 10Ema: [V: 032 C: 032] Revert "cache::canary: test varnish downgrade on pinkunicorn" [puppet] - 10https://gerrit.wikimedia.org/r/409033 (owner: 10Ema) [13:57:10] * akosiaris tails -f /var/log/kern.log [13:58:24] (03PS13) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [13:59:55] (03CR) 10jerkins-bot: [V: 04-1] Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180208T1400). [14:00:04] No GERRIT patches in the queue for this window AFAICS. [14:00:22] akosiaris: so swappiness still 0, but nothing is exploding ? [14:00:30] nope [14:00:30] in that case, I can SWAT :P [14:00:41] there's something more into the equation it seems [14:00:51] not just doing IO [14:00:51] I added https://gerrit.wikimedia.org/r/#/c/407901 in 14:59 [14:01:01] elukey: mind dropping the fs caches ? [14:01:14] Zoranzoki21: looks like you have added it after 15:00 :D [14:01:14] and redoing the rsync ? [14:01:23] don't forget to make sure you synced first [14:01:31] elukey, akosiaris: problems? can we swat? [14:01:34] zeljkof: I added it in 14:59:55 [14:01:43] zeljkof: no problems, proceed as normal [14:02:39] all system nominal, continue countdown [14:02:51] t minus 1 minute to liftoff [14:02:56] akosiaris: never forced it, do you mean "echo 3 > /proc/sys/vm/drop_caches" ? [14:03:11] or maybe only 1 [14:04:59] (03CR) 10Zoranzoki21: Change namespaces on urwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [14:07:14] elukey: 1 [14:08:11] for the swat record, reviewing 407901 [14:09:55] stephanebisson: I see https://test.wikipedia.org/wiki/Special:Version at 1.31.0-wmf.20 so https://tools.wmflabs.org/versions/ does look correct [14:13:25] (03PS2) 10Rush: Enable base::firewall for labweb* [puppet] - 10https://gerrit.wikimedia.org/r/409026 (owner: 10Muehlenhoff) [14:13:42] akosiaris: in progress [14:14:07] (03CR) 10Herron: [C: 04-2] "https://puppet-compiler.wmflabs.org/compiler02/9895/" [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) (owner: 10Herron) [14:14:16] let's see [14:14:42] (03CR) 10Rush: [C: 032] Enable base::firewall for labweb* [puppet] - 10https://gerrit.wikimedia.org/r/409026 (owner: 10Muehlenhoff) [14:16:20] (03PS14) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [14:17:01] (03PS15) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [14:17:35] (03CR) 10Zfilipin: Change namespaces on urwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [14:17:57] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:58] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:11] (03PS1) 10Ema: cache_upload: upgrade cp2002 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409036 (https://phabricator.wikimedia.org/T180433) [14:18:13] (03PS1) 10Ema: cache_upload: upgrade cp2005 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409037 (https://phabricator.wikimedia.org/T180433) [14:18:15] (03PS1) 10Ema: cache_upload: upgrade cp2008 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409038 (https://phabricator.wikimedia.org/T180433) [14:18:17] (03PS1) 10Ema: cache_upload: upgrade cp2011 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409039 (https://phabricator.wikimedia.org/T180433) [14:18:19] (03PS1) 10Ema: cache_upload: upgrade cp2014 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409040 (https://phabricator.wikimedia.org/T180433) [14:18:21] (03PS1) 10Ema: cache_upload: upgrade cp2017 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409041 (https://phabricator.wikimedia.org/T180433) [14:18:23] (03PS1) 10Ema: cache_upload: upgrade cp2020 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409042 (https://phabricator.wikimedia.org/T180433) [14:18:25] (03PS1) 10Ema: cache_upload: upgrade cp2022 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409043 (https://phabricator.wikimedia.org/T180433) [14:18:30] (03PS1) 10Ema: cache_upload: upgrade cp2024 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409044 (https://phabricator.wikimedia.org/T180433) [14:18:32] (03PS1) 10Ema: cache_upload: upgrade codfw to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409045 (https://phabricator.wikimedia.org/T180433) [14:18:48] elukey: you did it [14:18:57] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [14:18:59] PROBLEM - Host kubestagetcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:59] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:59] PROBLEM - Host ununpentium is DOWN: PING CRITICAL - Packet loss = 100% [14:18:59] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:59] PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:35] volans: did I win anything? :D [14:19:37] PROBLEM - Host meitnerium is DOWN: PING CRITICAL - Packet loss = 100% [14:19:55] elukey: some more reimages to do! [14:19:57] PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:59] :-P [14:20:13] volans: yessssssssss \o/ [14:20:23] omg [14:23:03] !log upgrade cp2002 to varnish 5 [14:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:37] PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 83593 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:23:39] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp2002 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409036 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [14:24:27] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 158093 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:24:37] RECOVERY - etcd request latencies on chlorine is OK: OK - etcd_request_latencies is 2799 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:24:58] (03PS16) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [14:25:05] fun part is that ganeti1005 did not log anything yet [14:25:18] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:29] RECOVERY - etcd request latencies on argon is OK: OK - etcd_request_latencies is 3610 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:25:45] and the VMs? [14:25:45] on the other hand with the memory being in stress (or whatever) that's kind of expected I guess [14:26:11] ctrl+c to my tail -f /var/log/kern.log hasn't yet done anything [14:26:33] I am guessing the box is still unresponsive [14:27:48] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:31] Zoranzoki21: do I need to run a script after deploying your patch? [14:30:26] indeed I can't find anything on lithium for remote syslog either [14:31:01] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [14:31:15] and it's not recovering. I am guessing we will have to force reboot it [14:31:28] this time around elukey really did it [14:31:32] zeljkof: Let me know when you're done with SWAT. [14:31:33] !log upgrading deployment-mediawiki04 to HHVM linked against ICU 57 [14:31:41] well done elukey ! [14:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:07] akosiaris: shall I powercycle? I'm on the mgmt since I wanted to check whether there's additional output to console (but there was none) [14:32:31] ah nice, I was starting it to log in myself. Yeah I think so [14:32:46] anomie: there is just one patch, I would be done in a few minutes if somebody could take a look and give it a +1 or -1 ;) [14:32:53] although, wait [14:32:54] https://gerrit.wikimedia.org/r/#/c/407901/ [14:33:07] console now logs oom killer messages [14:33:08] o/ [14:33:18] for various qemu instance, I bet we can login shortly [14:33:42] ah nice... so it's recovering [14:33:54] ~20mins it looks like [14:34:04] probably even more [14:34:30] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [14:35:01] RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [14:35:06] login works now [14:35:08] SSH [14:35:40] RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 2.05 ms [14:35:40] RECOVERY - Host ununpentium is UP: PING OK - Packet loss = 0%, RTA = 2.20 ms [14:35:50] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms [14:35:50] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 2.99 ms [14:36:00] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [14:36:00] RECOVERY - Host kubestagetcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [14:36:20] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [14:36:22] RECOVERY - Host etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [14:36:22] RECOVERY - Host meitnerium is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [14:36:38] Feb 8 14:17:47 ganeti1005 kernel: [ 9904.098151] qemu-system-x86: page allocation stalls for 17260ms, order:0, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK) [14:36:43] that's the very first line [14:36:57] this is with swappiness 0 or 1? [14:37:00] 0 [14:37:19] ok [14:37:32] at least it does look like we can reproduce it.. [14:37:46] (03PS1) 10Ema: wmf-upgrade-varnish: initial release [puppet] - 10https://gerrit.wikimedia.org/r/409047 (https://phabricator.wikimedia.org/T168529) [14:38:01] elukey: do you mind adding the commands used in a phab paste ? so I can trying getting a minimal reproduction case without bugging you all the time ? [14:38:23] (03CR) 10jerkins-bot: [V: 04-1] wmf-upgrade-varnish: initial release [puppet] - 10https://gerrit.wikimedia.org/r/409047 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [14:38:51] I'd say let's install 4.14.13-1~bpo9+1 from stretch-backports and retry with that kernel? that should tell us whether we're dealing with a kernel bug which is already fixed upstream or something deeper lying [14:38:58] Zoranzoki21: please stand by, hashar and I are reviewing the patch [14:39:19] zeljkof: Hmm. The linked task says Index should be "اشاریہ", but the patch has "شاری" instead. It looks like they also want additional aliases for Thesaurus and Reconstruction. And judging by other entries, the aliases should have underscores instead of spaces. [14:39:19] ok [14:39:23] akosiaris: sure, I can add them to the task [14:39:24] moritzm: yeah that next in my list. That and trying with the older kernel [14:39:40] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 3400580 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:39:51] anomie: uh oh. could you please leave a comment in gerrit? [14:40:07] we don't have to deploy today if it's not ready [14:40:22] (03CR) 10Anomie: "The linked task says Index should be "اشاریہ", but the patch has "شاری" instead. It looks like they also want additional aliases for Thesa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [14:41:08] (03PS1) 10Anomie: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on meta, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409048 (https://phabricator.wikimedia.org/T166733) [14:41:10] Zoranzoki21: let's not deploy 407901 today, looks like there are some problems with it [14:41:28] feel free to add the patch to another swat when the problems are fixed [14:42:34] Zoranzoki21: I am not even sure why some of those namespaces need to be renamed. Seems some are provided by mw extensions [14:42:39] but yeah hmm I don't know really :] [14:42:40] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:42:57] better to get it reviewed by someone familiar with the craziness namespaces are [14:43:26] thanks anomie and hashar [14:43:34] !log EU SWAT finished [14:43:40] RECOVERY - etcd request latencies on argon is OK: OK - etcd_request_latencies is 3816 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:43:45] anomie: ^ [14:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:56] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3955708 (10elukey) Steps to repro on meitnerium: ``` fsck /dev/vdb1 mount /dev/vdb1 /mnt/archiva rm -rf /mnt/archiva/* rsync -av /var/lib/archiva /mnt/archiva --progress ``` [14:44:02] akosiaris: --^ [14:44:12] (03CR) 10Anomie: [C: 032] "SWAT-ish" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409048 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:44:41] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [14:46:15] elukey: thanks! [14:46:20] (03Merged) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on meta, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409048 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:46:22] (03PS17) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [14:47:08] (03CR) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on meta, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409048 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:47:10] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:57] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: Setting wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on meta and mediawiki.org (duration: 01m 12s) [14:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:59] (03CR) 10Ottomata: [C: 031] role::logging::kafkatee::webrequest::base: move out code related to outputs [puppet] - 10https://gerrit.wikimedia.org/r/408771 (owner: 10Elukey) [14:49:08] !log migrate all running VMs off ganeti1005 T181121 [14:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:20] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [14:49:43] (03PS2) 10Ema: wmf-upgrade-varnish: initial release [puppet] - 10https://gerrit.wikimedia.org/r/409047 (https://phabricator.wikimedia.org/T168529) [14:49:51] (03CR) 10Ottomata: [C: 031] role::cache::misc: add a ad-hoc varnishkafka instance to test TLS [puppet] - 10https://gerrit.wikimedia.org/r/409027 (https://phabricator.wikimedia.org/T185136) (owner: 10Elukey) [14:50:50] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 39 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:55:50] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:56:23] !log upgrade cp2005 to varnish 5 [14:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:57] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp2005 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409037 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [14:59:46] 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, and 2 others: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3955768 (10elukey) Finally we have something working! Example from stat1004 ``` elukey@stat1004:~$ hive [.. som output ..] h... [14:59:51] !log installing icu security updates from jessie/stretch point releases [15:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:21] (03PS2) 10Elukey: role::logging::kafkatee::webrequest::base: move out code related to outputs [puppet] - 10https://gerrit.wikimedia.org/r/408771 [15:01:48] (03CR) 10Elukey: [C: 032] role::logging::kafkatee::webrequest::base: move out code related to outputs [puppet] - 10https://gerrit.wikimedia.org/r/408771 (owner: 10Elukey) [15:06:58] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409051 [15:08:12] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409051 (owner: 10Marostegui) [15:10:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "So I verified, and while you need the librenms user to access those files, you don't need www-data (the web user) to do it. So what you're" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [15:11:38] akosiaris: I checked https://www.mediawiki.org/wiki/Special:Version and it says wmf.17 so it looks like group0 is not all running the same version. That's unexpected. [15:12:39] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409051 (owner: 10Marostegui) [15:12:50] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409051 (owner: 10Marostegui) [15:12:57] (03Draft1) 10Paladox: Gerrit: Set change.disablePrivateChanges to true [puppet] - 10https://gerrit.wikimedia.org/r/409052 [15:13:01] (03PS2) 10Paladox: Gerrit: Set change.disablePrivateChanges to true [puppet] - 10https://gerrit.wikimedia.org/r/409052 [15:13:25] (03CR) 10Paladox: "https://gerrit-review.googlesource.com/#/c/gerrit/+/157710/" [puppet] - 10https://gerrit.wikimedia.org/r/409052 (owner: 10Paladox) [15:14:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1073 - T162807 (duration: 01m 12s) [15:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:35] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [15:15:24] (03PS4) 10Muehlenhoff: Add support for selective automatic restarts of stateless services after library upgrades (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) [15:16:07] akosiaris: just to avoid stepping on your feet while testing - Can I proceed with completing the archiva work on meitnerium or do you still need it for testing? [15:16:10] (03CR) 10jerkins-bot: [V: 04-1] Add support for selective automatic restarts of stateless services after library upgrades (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:16:43] elukey: niah I 'll try to reproduce with a different VM. Proceed with whatever it is you were doing. Thanks! [15:17:07] (03PS2) 10Ema: cache_upload: upgrade cp2008 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409038 (https://phabricator.wikimedia.org/T180433) [15:17:32] (03PS1) 10BBlack: dns-rec-lb IPs for ulsfo and eqsin [dns] - 10https://gerrit.wikimedia.org/r/409053 [15:18:20] !log upgrade cp2008 to varnish 5 [15:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:15] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp2008 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409038 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [15:19:17] akosiaris: ack thanks! [15:20:55] (03CR) 10BBlack: [C: 032] dns-rec-lb IPs for ulsfo and eqsin [dns] - 10https://gerrit.wikimedia.org/r/409053 (owner: 10BBlack) [15:31:13] (03PS1) 10Filippo Giunchedi: WIP: check prometheus metric [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410) [15:31:35] (03CR) 10jerkins-bot: [V: 04-1] WIP: check prometheus metric [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [15:33:32] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [15:33:52] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [15:35:05] !log upgrade cp2011 to varnish 5 [15:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:00] (03PS1) 10BBlack: ntp.conf: trivial restrict changes [puppet] - 10https://gerrit.wikimedia.org/r/409055 [15:36:02] (03PS1) 10BBlack: ntp peering refactor [puppet] - 10https://gerrit.wikimedia.org/r/409056 [15:36:04] (03PS1) 10BBlack: ntp: use "pool" and "restrict source" on stretch [puppet] - 10https://gerrit.wikimedia.org/r/409057 [15:36:06] (03PS1) 10BBlack: ulsfo LVS config for dns-rec-lb [puppet] - 10https://gerrit.wikimedia.org/r/409058 [15:36:08] (03PS1) 10BBlack: dns400x: configure role::recursor for ntp+dns [puppet] - 10https://gerrit.wikimedia.org/r/409059 [15:36:10] (03PS1) 10BBlack: ntp: add dns400x to peer lists [puppet] - 10https://gerrit.wikimedia.org/r/409060 [15:36:12] (03PS1) 10BBlack: ulsfo: use local DNS [puppet] - 10https://gerrit.wikimedia.org/r/409061 [15:36:14] (03PS2) 10Ema: cache_upload: upgrade cp2011 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409039 (https://phabricator.wikimedia.org/T180433) [15:36:24] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp2011 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409039 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [15:36:27] (03PS5) 10Muehlenhoff: Add support for selective automatic restarts of stateless services after library upgrades (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) [15:38:17] (03CR) 10jerkins-bot: [V: 04-1] Add support for selective automatic restarts of stateless services after library upgrades (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:46:22] (03CR) 10BBlack: [C: 031] "as expected through here on existing servers, will control roll-out: https://puppet-compiler.wmflabs.org/compiler02/9896/" [puppet] - 10https://gerrit.wikimedia.org/r/409057 (owner: 10BBlack) [15:46:52] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:03] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:43] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [15:47:51] !log disabling puppet on all global dns recursors for controlled config deploy [15:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:16] !log installing libio-socket-ssl-perl update from jessie point release [15:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:40] (03PS2) 10BBlack: ntp.conf: trivial restrict changes [puppet] - 10https://gerrit.wikimedia.org/r/409055 [15:48:42] (03PS2) 10BBlack: ntp peering refactor [puppet] - 10https://gerrit.wikimedia.org/r/409056 [15:48:47] (03PS2) 10BBlack: ntp: use "pool" and "restrict source" on stretch [puppet] - 10https://gerrit.wikimedia.org/r/409057 [15:48:49] (03PS2) 10BBlack: ulsfo LVS config for dns-rec-lb [puppet] - 10https://gerrit.wikimedia.org/r/409058 [15:48:52] (03PS2) 10BBlack: dns400x: configure role::recursor for ntp+dns [puppet] - 10https://gerrit.wikimedia.org/r/409059 [15:48:55] (03PS2) 10BBlack: ntp: add dns400x to peer lists [puppet] - 10https://gerrit.wikimedia.org/r/409060 [15:48:57] (03PS2) 10BBlack: ulsfo: use local DNS [puppet] - 10https://gerrit.wikimedia.org/r/409061 [15:48:59] (03CR) 10BBlack: [C: 032] ntp.conf: trivial restrict changes [puppet] - 10https://gerrit.wikimedia.org/r/409055 (owner: 10BBlack) [15:49:05] (03CR) 10BBlack: [C: 032] ntp peering refactor [puppet] - 10https://gerrit.wikimedia.org/r/409056 (owner: 10BBlack) [15:49:18] (03CR) 10BBlack: [C: 032] ntp: use "pool" and "restrict source" on stretch [puppet] - 10https://gerrit.wikimedia.org/r/409057 (owner: 10BBlack) [15:50:28] 10Operations, 10ops-eqiad, 10hardware-requests: decom fluorine - https://phabricator.wikimedia.org/T159996#3955878 (10RobH) [15:50:52] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:52] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [15:53:32] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:54:10] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#3955897 (10MoritzMuehlenhoff) [15:54:32] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset -0.000773 secs [15:56:02] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:02] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:57:08] !log upgrade cp2014 to varnish 5 [15:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:53] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:58:03] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset 0.008291 secs [15:58:22] (03PS2) 10Ema: cache_upload: upgrade cp2014 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409040 (https://phabricator.wikimedia.org/T180433) [15:58:32] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp2014 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409040 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [15:58:53] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset 0.004519 secs [15:59:53] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [16:00:53] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 0.000883 secs [16:01:42] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [16:02:42] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 0.009152 secs [16:02:53] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [16:03:41] !log andrew@tin Started deploy [horizon/deploy@2f176e2]: updating with designate dashboard [16:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:52] !log andrew@tin Finished deploy [horizon/deploy@2f176e2]: updating with designate dashboard (duration: 01m 11s) [16:04:53] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.003506 secs [16:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:11] !log ntp servers back to normal [16:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:18] (03CR) 10Zoranzoki21: "Ahh 409k :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409000 (owner: 10Marostegui) [16:08:43] ACKNOWLEDGEMENT - IPMI Sensor Status on helium is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] Muehlenhoff T186808 [16:10:05] !log andrew@tin Started deploy [horizon/deploy@9af532a]: updating with designate dashboard -- take two [16:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:28] !log andrew@tin Finished deploy [horizon/deploy@9af532a]: updating with designate dashboard -- take two (duration: 01m 24s) [16:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:25] !log upgrade cp2017 to varnish 5 [16:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:52] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:55] (03PS2) 10Ema: cache_upload: upgrade cp2017 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409041 (https://phabricator.wikimedia.org/T180433) [16:13:57] !log rebooting dns400[12] (downtimed, currently spare::system) [16:14:07] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp2017 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409041 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [16:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:32] RECOVERY - Host mc2036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.63 ms [16:17:30] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: sca1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=sca', 'service=zotero']) [16:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:27] !log depool sca1004 (zotero) for T181121 [16:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:38] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [16:18:42] (03PS1) 10Ayounsi: Smokeping: commenting out asw1-eqsin until SFP-T replaced [puppet] - 10https://gerrit.wikimedia.org/r/409071 [16:19:02] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 4.16 ms [16:20:13] (03CR) 10Ayounsi: [C: 032] Smokeping: commenting out asw1-eqsin until SFP-T replaced [puppet] - 10https://gerrit.wikimedia.org/r/409071 (owner: 10Ayounsi) [16:20:42] !log reboot ganeti1005 T181121 [16:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "FYI service-checker 0.14 has been uploaded for both jessie and stretch" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) (owner: 10Dduvall) [16:23:42] !log stop archiva on meitnerium to swap /var/lib/archiva from the root partition to a new separate one - T186020 [16:23:42] PROBLEM - Host ganeti1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:48] that's me ^ [16:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:54] T186020: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020 [16:23:59] (03PS3) 10BBlack: ulsfo LVS config for dns-rec-lb [puppet] - 10https://gerrit.wikimedia.org/r/409058 [16:24:01] (03PS3) 10BBlack: dns400x: configure role::recursor for ntp+dns [puppet] - 10https://gerrit.wikimedia.org/r/409059 [16:24:02] RECOVERY - Host ganeti1005 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:24:03] (03PS3) 10BBlack: ntp: add dns400x to peer lists [puppet] - 10https://gerrit.wikimedia.org/r/409060 [16:24:06] (03PS3) 10BBlack: ulsfo: use local DNS [puppet] - 10https://gerrit.wikimedia.org/r/409061 [16:25:51] !log puppet disabled on lvs400[67] for initial ulsfo recdns config process [16:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:29] (03CR) 10BBlack: [C: 032] ulsfo LVS config for dns-rec-lb [puppet] - 10https://gerrit.wikimedia.org/r/409058 (owner: 10BBlack) [16:26:33] (03CR) 10BBlack: [C: 032] dns400x: configure role::recursor for ntp+dns [puppet] - 10https://gerrit.wikimedia.org/r/409059 (owner: 10BBlack) [16:27:48] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3955995 (10akosiaris) [16:28:10] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3258942 (10akosiaris) 05Open>03Resolved This is finally done, resolving [16:29:16] (03CR) 10BBlack: [C: 032] ntp: add dns400x to peer lists [puppet] - 10https://gerrit.wikimedia.org/r/409060 (owner: 10BBlack) [16:30:01] !log puppet disabled on all ntp servers for initial ulsfo recdns/ntp config process [16:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:20] !log upgrade cp2020 to varnish 5 [16:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:41] (03PS2) 10Ema: cache_upload: upgrade cp2020 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409042 (https://phabricator.wikimedia.org/T180433) [16:31:55] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp2020 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409042 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [16:32:38] !log installing mysql security updates on auth* [16:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:59] (03PS2) 10Alexandros Kosiaris: admin: Add builder-docker group extending ops rights [puppet] - 10https://gerrit.wikimedia.org/r/407642 [16:34:32] PROBLEM - SSH on labtestnet2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:54] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [16:36:22] RECOVERY - SSH on labtestnet2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [16:37:32] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [16:37:42] PROBLEM - Host mc2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:39:43] !log installing PHP7 security updates [16:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:42] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:45:52] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#3956024 (10RobH) p:05Triage>03High a:03Cmjohnson [16:46:11] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#3955897 (10RobH) p:05High>03Normal [16:46:40] 10Operations, 10Discovery, 10Discovery-Search, 10Wikidata, and 2 others: Setup a WDQS test cluster on real hardware - https://phabricator.wikimedia.org/T186713#3956027 (10RobH) p:05Triage>03Normal [16:47:02] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:43] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [16:48:12] RECOVERY - Host mc2036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.60 ms [16:50:43] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:45] (03PS3) 10Alexandros Kosiaris: admin: Add builder-docker group extending ops rights [puppet] - 10https://gerrit.wikimedia.org/r/407642 [16:51:02] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [16:51:17] (03CR) 10Alexandros Kosiaris: [C: 032] admin: Add builder-docker group extending ops rights [puppet] - 10https://gerrit.wikimedia.org/r/407642 (owner: 10Alexandros Kosiaris) [16:51:55] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#3956034 (10Papaul) a:05Papaul>03RobH Main board replacement complete. New eth0 MAC address : e0:07:1b:f8:87:c8 [16:52:29] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#3956037 (10Cmjohnson) This server is 5+years old and needs to be replaced. . [16:54:03] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [16:54:12] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:03] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset -0.001183 secs [16:58:20] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [16:58:48] !log upgrade cp2022 to varnish 5 [16:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:05] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#3955897 (10RobH) I just wanted to make sure this was a failed power supply, and not simply a power cable coming unseated? There isn't really a way to tell the difference remotely. [16:59:49] (03PS2) 10Ema: cache_upload: upgrade cp2022 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409043 (https://phabricator.wikimedia.org/T180433) [16:59:49] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [16:59:58] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp2022 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409043 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [17:00:04] godog, moritzm, and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180208T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:12] 10Operations, 10ops-eqiad: Missing servers in racktables - https://phabricator.wikimedia.org/T186814#3956080 (10Cmjohnson) [17:00:19] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset 0.002834 secs [17:01:49] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset 0.001975 secs [17:02:10] no puppet swat -> https://i.imgur.com/WaEoarS.mp4 [17:02:49] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [17:03:11] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#3956101 (10akosiaris) >>! In T186808#3956037, @Cmjohnson wrote: > This server is 5+years old and needs to be replaced. . Yes, it's already been scheduled to happen, but timing is still uncertain. [17:03:15] (03CR) 10BBlack: [C: 031] role::cache::misc: add a ad-hoc varnishkafka instance to test TLS [puppet] - 10https://gerrit.wikimedia.org/r/409027 (https://phabricator.wikimedia.org/T185136) (owner: 10Elukey) [17:03:38] godog's gifs are back <3 [17:05:09] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [17:06:50] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [17:07:06] (03PS1) 10Jgreen: reassign frlog1001 management hostname/ip into mgmt.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/409080 [17:07:47] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3956106 (10elukey) ``` elukey@meitnerium:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 10M 0 10M 0% /dev tmpfs 792M 8.4M 783... [17:08:09] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:08:21] (03CR) 10Jgreen: [C: 032] reassign frlog1001 management hostname/ip into mgmt.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/409080 (owner: 10Jgreen) [17:08:50] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 0.000692 secs [17:11:56] (03PS2) 10Elukey: role::cache::misc: add a ad-hoc varnishkafka instance to test TLS [puppet] - 10https://gerrit.wikimedia.org/r/409027 (https://phabricator.wikimedia.org/T185136) [17:12:17] (03PS3) 10Herron: puppetdb: add major version parameter and add data types [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) [17:12:54] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: add major version parameter and add data types [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) (owner: 10Herron) [17:13:57] (03CR) 10Elukey: [C: 032] role::cache::misc: add a ad-hoc varnishkafka instance to test TLS [puppet] - 10https://gerrit.wikimedia.org/r/409027 (https://phabricator.wikimedia.org/T185136) (owner: 10Elukey) [17:14:00] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:00] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [17:16:27] !log upgrade cp2024 to varnish 5 [17:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:59] (03PS2) 10Ema: cache_upload: upgrade cp2024 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409044 (https://phabricator.wikimedia.org/T180433) [17:17:08] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp2024 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409044 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [17:17:37] (03PS4) 10Herron: puppetdb: add major version parameter and add data types [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) [17:18:09] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:17] these are mine --^ [17:18:24] new varnishkafka instance that doesn't like me [17:18:26] checking now [17:18:33] :) [17:18:36] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: add major version parameter and add data types [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) (owner: 10Herron) [17:21:09] it might be the fact that the kafka cluster name is "jumbo_eqiad" and not only "jumbo", but how is it working then on pink unicorn? [17:21:25] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: sca1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=sca', 'service=zotero']) [17:21:34] !log repool sca1004 (zotero) for T181121 [17:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:50] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [17:22:02] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:31] elukey: what's up? [17:22:51] ottomata: I am getting Error while evaluating a Function Call, undefined method `[]' for nil:NilClass at /etc/puppet/modules/profile/manifests/cache/kafka/webrequest/jumbo.pp:29:15 on node cp2006.codfw.wmnet [17:23:10] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:12] that kinda makes sense since the default kafka cluster is 'jumbo' and not 'jumbo-eqiad' [17:23:26] buuut why is it working on pink unicorn? [17:24:13] <_joe_> because on a pink unicorn, everything is more awesome! [17:24:32] (03CR) 10Volans: "Nice! I'm sorry you had to use this instead of the not-yet-developed spinoff of the switchdc mini framework, we should allocate more time " (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/409047 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [17:24:37] I think that it might be due to me being stupid and not seeing somethinb obvious [17:24:48] <_joe_> or that, yes [17:24:48] if $::realm == 'production' and $::hostname != 'cp1008' { [17:24:51] include ::role::ipsec [17:24:54] <_joe_> occam's razor? [17:24:54] } [17:25:05] elukey: is there something like that on misc? [17:25:05] <_joe_> ema: you spoiled it! [17:25:17] as in, does misc include role::ipsec? [17:25:46] uh that was an ugly paste, sorry about that :) [17:25:56] how is this ipsec related? [17:26:11] yeah I was about to ask [17:26:55] that would probably break things, buuut ya probably not problem. hmmm [17:27:43] ohHHHH [17:27:47] because codfw [17:27:48] hmmm [17:27:49] wait, hm [17:27:49] this has to do with kafka_config() [17:28:20] yes [17:28:23] I think it is due to the fact that I'd need to use 'jumbo-eqiad' [17:28:26] hm [17:28:26] yeah [17:28:29] i think you will [17:28:31] that will work [17:28:38] but how does it work on cp1008? [17:29:04] because cp1008 is in eqiad [17:29:10] # Else expect that the caller wants the kafka cluster for prefix in the current datacenter. [17:29:10] else [17:29:10] "#{prefix}-#{site}" [17:29:17] * elukey cries in a corner [17:29:19] kafka_cluster_name.rb [17:29:35] and [17:29:51] 'analytics' cluster name has a stupid special case that we will be glad to get rid of [17:29:58] because historically it is really called 'eqiad' [17:30:04] # There is only one analytics cluster, it lives in eqiad. [17:30:04] # For historical reasons, the name of this cluster is 'eqiad'. [17:30:04] elsif prefix == 'analytics' [17:30:04] 'eqiad' [17:30:04] TODO: simplify/destroy kafka_config() and kafka_cluster_name() [17:30:05] (03PS1) 10Elukey: profile::cache::kafka::webrequest::jumbo: fix default kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/409085 (https://phabricator.wikimedia.org/T185136) [17:30:55] (03CR) 10Elukey: [C: 032] profile::cache::kafka::webrequest::jumbo: fix default kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/409085 (https://phabricator.wikimedia.org/T185136) (owner: 10Elukey) [17:32:07] (03PS1) 10Arturo Borrero Gonzalez: apt: fix confirmation prompt of apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/409086 (https://phabricator.wikimedia.org/T181647) [17:33:18] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: fix confirmation prompt of apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/409086 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:33:28] (03PS2) 10Arturo Borrero Gonzalez: apt: fix confirmation prompt of apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/409086 (https://phabricator.wikimedia.org/T181647) [17:33:36] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] apt: fix confirmation prompt of apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/409086 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:34:46] puppet is now running [17:35:51] (03PS5) 10Herron: puppetdb: add major version parameter and add data types [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) [17:36:39] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [17:36:59] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:37:05] elukey: minor nit, you haven't updated the comment about kafka_cluster_name (Default: jumbo). For a future commit :) [17:37:10] <_joe_> /win 24 [17:37:49] ema: ack, thanks! [17:38:09] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:40:50] (03PS6) 10Herron: puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) [17:41:46] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) (owner: 10Herron) [17:47:49] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:49:49] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [17:49:50] (03PS1) 10Jforrester: Enable the visual diff beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409091 [17:52:37] (03PS1) 10Andrew Bogott: horizon: sudo policy for deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/409093 [17:54:54] (03PS2) 10Andrew Bogott: horizon: sudo policy for deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/409093 [17:55:19] (03CR) 10Hashar: Add support for selective automatic restarts of stateless services after library upgrades (WIP) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:55:31] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=dns400[12].wikimedia.org [17:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:48] (03CR) 10Andrew Bogott: [C: 032] horizon: sudo policy for deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/409093 (owner: 10Andrew Bogott) [17:55:59] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:56] (03PS7) 10Herron: puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) [17:58:59] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [18:00:05] cscott, arlolra, subbu, halfak, and Amir1: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180208T1800). Please do the needful. [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:14] (03PS1) 10Andrew Bogott: horizon: provide sudo rights to deploy-service, the actual deployer [puppet] - 10https://gerrit.wikimedia.org/r/409095 [18:00:18] No ORES patches. [18:00:45] (03CR) 10Andrew Bogott: [C: 032] horizon: provide sudo rights to deploy-service, the actual deployer [puppet] - 10https://gerrit.wikimedia.org/r/409095 (owner: 10Andrew Bogott) [18:06:09] (03CR) 10Herron: [C: 04-2] "https://puppet-compiler.wmflabs.org/compiler03/9900/" [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) (owner: 10Herron) [18:08:10] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:09:27] (03PS6) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [18:10:06] !log upgrade cp2026 to varnish 5 [18:10:07] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [18:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:27] (03CR) 10Ema: [C: 032] cache_upload: upgrade codfw to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409045 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [18:12:29] (03PS2) 10Ema: cache_upload: upgrade codfw to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409045 (https://phabricator.wikimedia.org/T180433) [18:12:31] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade codfw to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/409045 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [18:15:15] Did someone deploy something recently? hewiki characters are broken [18:19:21] !log arlolra@tin Started deploy [parsoid/deploy@1367057]: Updating Parsoid to 961a5cf [18:19:26] (03PS7) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [18:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:37] (03CR) 10Volans: [C: 031] "LGTM, optional nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) (owner: 10Herron) [18:19:45] matanya: can you give an example? [18:20:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [18:20:58] (03PS1) 10Andrew Bogott: horizon with scap: another try at compressing static content [puppet] - 10https://gerrit.wikimedia.org/r/409100 [18:21:03] jynus: my bad, firefox-i18n file updated and required browser restart [18:21:18] sorry about the false alarm [18:22:12] (03PS8) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [18:22:37] (03PS2) 10Andrew Bogott: horizon with scap: another try at compressing static content [puppet] - 10https://gerrit.wikimedia.org/r/409100 [18:22:48] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [18:23:10] (03CR) 10Andrew Bogott: [C: 032] horizon with scap: another try at compressing static content [puppet] - 10https://gerrit.wikimedia.org/r/409100 (owner: 10Andrew Bogott) [18:25:23] !log andrew@tin Started deploy [horizon/deploy@9e9d458]: updating with designate dashboard -- take three [18:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:39] !log andrew@tin Finished deploy [horizon/deploy@9e9d458]: updating with designate dashboard -- take three (duration: 01m 16s) [18:26:42] (03PS9) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [18:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:18] (03PS6) 10Herron: puppetdb: add major version parameter and add data types [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) [18:27:32] !log arlolra@tin Finished deploy [parsoid/deploy@1367057]: Updating Parsoid to 961a5cf (duration: 08m 11s) [18:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:10] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [18:29:41] (03CR) 10Herron: [C: 032] puppetdb: add major version parameter and add data types [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) (owner: 10Herron) [18:32:01] (03CR) 10Jcrespo: "Please Manuel spend some time deciphering my changes (I have not yet changed the role name, that is a trivial amend, not definitive). This" [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [18:32:52] (03CR) 10Herron: "Thanks volans! Going to leave the optional bit as-is for now (since it's simply passed on to puppetdb::app). it would be good to talk ab" [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) (owner: 10Herron) [18:33:30] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [18:33:35] (03CR) 10Jcrespo: "BTW- regarding the name, I chose an absurd temporary name, because maybe this shouldn't be "dbstore" anymore (but a "provisioning server" " [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [18:33:54] (03PS8) 10Herron: puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) [18:37:19] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:35] !log Updated Parsoid to 961a5cf (T186630) [18:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:47] T186630: Cannot read property 'length' of null - https://phabricator.wikimedia.org/T186630 [18:43:20] (03CR) 10Herron: [C: 032] puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) (owner: 10Herron) [18:47:49] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:49] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [18:48:50] (03PS1) 10BBlack: dns for ulsfo x-vlan LVS [dns] - 10https://gerrit.wikimedia.org/r/409107 [18:49:53] (03PS1) 10Andrew Bogott: keystone: allow api access to labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/409108 (https://phabricator.wikimedia.org/T168470) [18:50:47] (03PS1) 10BBlack: lvs@ulsfo: add public tagged vlans [puppet] - 10https://gerrit.wikimedia.org/r/409109 [18:53:30] (03CR) 10BBlack: [C: 032] dns for ulsfo x-vlan LVS [dns] - 10https://gerrit.wikimedia.org/r/409107 (owner: 10BBlack) [18:53:51] (03CR) 10Rush: [C: 031] "One comment where I'm not totally sure for ferm but otherise nice and thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409108 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:54:06] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#3956361 (10herron) [18:54:22] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#3652246 (10herron) [18:54:24] 10Operations, 10Puppet, 10Patch-For-Review: Add PuppetDB version selector (puppet/hiera) - https://phabricator.wikimedia.org/T185501#3956362 (10herron) 05Open>03Resolved a:03herron [18:54:27] !log bsitzmann@tin Started deploy [mobileapps/deploy@541a7f7]: Update mobileapps to e6fbc94 [18:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:05] !log lvs@ulsfo - puppet disabled, trying tagged vlan deploy [18:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:38] 10Operations, 10Puppet, 10Patch-For-Review: Extend puppetmaster::puppetdb to support puppetlabs packaged puppetdb 4.4 - https://phabricator.wikimedia.org/T185500#3956367 (10herron) [18:55:45] (03CR) 10BBlack: [C: 032] lvs@ulsfo: add public tagged vlans [puppet] - 10https://gerrit.wikimedia.org/r/409109 (owner: 10BBlack) [18:55:49] PROBLEM - Host helium is DOWN: PING CRITICAL - Packet loss = 100% [18:56:46] (03CR) 10Andrew Bogott: [C: 032] keystone: allow api access to labweb hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409108 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:56:50] (03PS2) 10Andrew Bogott: keystone: allow api access to labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/409108 (https://phabricator.wikimedia.org/T168470) [18:57:08] (03PS1) 10Ayounsi: Adding IPs for cr1-codfw <--> cr1-eqsin link [dns] - 10https://gerrit.wikimedia.org/r/409113 [18:57:49] RECOVERY - Host helium is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:58:00] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:58:44] (03PS1) 10Andrew Bogott: Revert "keystone: allow api access to labweb hosts" [puppet] - 10https://gerrit.wikimedia.org/r/409114 [18:58:59] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [18:59:54] (03CR) 10Andrew Bogott: [C: 032] "ferm doesn't like comma-delimited arrays. I don't know how/if this works where it's done elsewhere..." [puppet] - 10https://gerrit.wikimedia.org/r/409114 (owner: 10Andrew Bogott) [19:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180208T1900). [19:00:06] tgr and RoanKattouw: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:13] o/ [19:00:36] !log lvs@ulsfo - all back to normal [19:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:40] o/ [19:02:14] I can do the deploy [19:02:48] !log bsitzmann@tin Finished deploy [mobileapps/deploy@541a7f7]: Update mobileapps to e6fbc94 (duration: 08m 21s) [19:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:59] 10Operations, 10ops-eqiad: OfflineUncorrectableSector on mw1256 sda - https://phabricator.wikimedia.org/T186535#3956376 (10RobH) p:05Triage>03Normal a:03Cmjohnson SAL states depooled & looks like its ready to go, powered down. >>! In T186535#3954891, @Stashbot wrote: > {nav icon=file, name=Mentioned in... [19:05:12] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-fgiunchedi: Offline uncorrectable sectors on poolcounter1002 /dev/sda - https://phabricator.wikimedia.org/T186534#3956380 (10RobH) p:05Triage>03Normal a:03fgiunchedi Setting to normal priority and assigned to @fgiunchedi until he merges his change... [19:05:44] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#3956384 (10Cmjohnson) The PSU was replaced with a spare from a decom'd server. [19:05:53] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#3956385 (10Cmjohnson) 05Open>03Resolved [19:06:44] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3956391 (10RobH) p:05Triage>03High Changing from 'needs triage' to 'high' priority, since this is tied to quarterly goals. [19:07:35] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3956393 (10RobH) p:05Triage>03Low [19:07:56] 10Operations, 10ops-eqiad: Missing servers in racktables - https://phabricator.wikimedia.org/T186814#3956395 (10RobH) p:05Triage>03High [19:08:10] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:08:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Complete decom process for server caesium - https://phabricator.wikimedia.org/T182805#3956396 (10RobH) a:05Cmjohnson>03RobH [19:09:30] (03PS2) 10Catrope: Enable TemplateStyles extension on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394831 (https://phabricator.wikimedia.org/T176082) (owner: 10Jon Harald Søby) [19:09:34] (03CR) 10Catrope: [C: 032] Enable TemplateStyles extension on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394831 (https://phabricator.wikimedia.org/T176082) (owner: 10Jon Harald Søby) [19:10:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Complete decom process for server caesium - https://phabricator.wikimedia.org/T182805#3834960 (10RobH) p:05Triage>03Normal [19:10:33] tgr: "Fix broken returnTo handling" is on mwdebug1002 [19:10:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10Patch-For-Review: decom wtp1001-wtp1024 - https://phabricator.wikimedia.org/T177374#3956401 (10RobH) p:05Triage>03Normal [19:11:46] (03PS1) 10Andrew Bogott: keystone: allow api access to labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/409118 (https://phabricator.wikimedia.org/T168470) [19:11:48] (03Merged) 10jenkins-bot: Enable TemplateStyles extension on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394831 (https://phabricator.wikimedia.org/T176082) (owner: 10Jon Harald Søby) [19:12:07] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3956415 (10RobH) p:05Triage>03Normal I'm setting this to normal priority in my dc-ops triaging, as it doesn't seem to be prior... [19:12:09] (03CR) 10jenkins-bot: Enable TemplateStyles extension on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394831 (https://phabricator.wikimedia.org/T176082) (owner: 10Jon Harald Søby) [19:12:17] RoanKattouw: it's an EventLogging fix, not sure how quickly that can be verified [19:12:23] OK [19:12:28] Also scap pull is hanging for some reason [19:12:31] Still hasn't finished [19:12:35] did not break functionality, at least [19:13:03] Oh now it's done [19:13:25] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3956420 (10jcrespo) p:05Normal>03High [19:14:36] (03CR) 10Andrew Bogott: [C: 032] keystone: allow api access to labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/409118 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [19:15:23] tgr: Your config patch is also on 1002 now [19:15:30] I'll deploy the EventLogging one [19:16:35] RoanKattouw: the config patch is working [19:17:22] !log catrope@tin Synchronized php-1.31.0-wmf.20/extensions/Campaigns/CampaignsSecondaryAuthenticationProvider.php: T185870 (duration: 01m 13s) [19:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:34] T185870: Monitor registration rates to make sure captcha changes have no negative effects - https://phabricator.wikimedia.org/T185870 [19:17:41] OK cool I'll push that next [19:18:22] (03PS3) 10Dduvall: Add service-checker image used to test service images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) [19:19:10] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable TemplateStyles on svwiki (T176082) (duration: 01m 11s) [19:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:24] T176082: Deploy TemplateStyles to svwiki - https://phabricator.wikimedia.org/T176082 [19:19:37] the eventlogging one is not collecting any data but I have no idea how close is EL to realtime [19:20:31] (03CR) 10Dduvall: Add service-checker image used to test service images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) (owner: 10Dduvall) [19:23:02] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3956432 (10jcrespo) This is high for us DBAs, not high for dc-ops (but we cannot express that difference). [19:23:49] (03PS1) 10Andrew Bogott: designate: allow labweb hosts access to designate api [puppet] - 10https://gerrit.wikimedia.org/r/409119 (https://phabricator.wikimedia.org/T168470) [19:24:12] !log catrope@tin Synchronized php-1.31.0-wmf.20/extensions/Flow/includes/Model/AbstractRevision.php: T186077 (duration: 01m 11s) [19:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:24] T186077: 'undo' action on SD boards - Fatal exception of type "Wikimedia\Rdbms\DBQueryError" - https://phabricator.wikimedia.org/T186077 [19:26:19] (03CR) 10Dduvall: [C: 04-1] "On second thought, there still seems to be a missing dependency for the `python-service-checker` package." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) (owner: 10Dduvall) [19:28:11] (03CR) 10Andrew Bogott: [C: 032] designate: allow labweb hosts access to designate api [puppet] - 10https://gerrit.wikimedia.org/r/409119 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [19:30:04] (03PS4) 10Dduvall: Add service-checker image used to test service images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) [19:30:08] ema: oh, paladox was talking about that being a bug,, that it doesnt find me by username, and the work-around is using the complete email address apparently [19:30:19] Yep [19:30:42] does it affect only me? heh [19:30:46] because i see it autocomplete others [19:30:52] and it worked before upgrade [19:30:54] no_justification hi, when you have a chance could you see if doing the online reindex on mutante account works please? [19:31:15] added myself with email [19:31:32] (03CR) 10Dduvall: Add service-checker image used to test service images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) (owner: 10Dduvall) [19:34:11] reindexing sounds like a possible fix, yep [19:36:03] Yep [19:36:11] paladox: for that external-id's link: firefox: file not found chromium: ERR_INVALID_RESPONSE [19:36:17] https://gerrit.wikimedia.org/r/accounts/dzahn@wikimedia.org/external.ids [19:36:18] curl: not allowed to get external IDs [19:36:19] PROBLEM - Check systemd state on notebook1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:36:22] hmm that should work [19:36:27] mutante or try [19:36:30] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [19:36:31] https://gerrit.wikimedia.org/r/accounts/self/external.ids [19:36:34] (03CR) 10Krinkle: "Yep, we always clear files from current HEAD branch. Both to avoid false positive matches in Git-wide searches, as well as to leave a sing" [software/gdash] - 10https://gerrit.wikimedia.org/r/408787 (https://phabricator.wikimedia.org/T186696) (owner: 10MarcoAurelio) [19:36:46] paladox: with "self" it works and i get a download [19:36:50] Ok [19:36:51] (03CR) 10Krinkle: [V: 032 C: 032] Archive repository [software/gdash] - 10https://gerrit.wikimedia.org/r/408787 (https://phabricator.wikimedia.org/T186696) (owner: 10MarcoAurelio) [19:36:59] pasting content in PM [19:37:04] ok thanks :) [19:37:24] (03CR) 10Krinkle: [V: 032 C: 032] Mark repository as read-only [software/gdash] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/408788 (https://phabricator.wikimedia.org/T186696) (owner: 10MarcoAurelio) [19:38:23] RoanKattouw: oh duh wmf20 has been rolled back right? not much to see there then [19:39:06] so your content looks correct. [19:39:16] i think this is the same index bug we were hitting with the accounts [19:39:24] tgr: Yeah well at least in my case I was fixing a wmf.20 blocker :) [19:39:30] which is why upstream decided to change it to read from the db instead of the index. [19:39:54] ssh -p 29418 review.example.com gerrit index start accounts --force [19:40:39] I'll just check the EL data after the train then and fix in the evening swat if there are issues [19:46:01] paladox: Reindexing all accounts at about 100/s [19:46:18] Yep, thanks. [19:46:19] :) [19:46:40] 10Operations, 10ops-eqiad: Missing servers in racktables - https://phabricator.wikimedia.org/T186814#3956461 (10Cmjohnson) Updated racktables for notebook1004 and an1023 is now known as conf1001 wmf4053 [19:46:46] 10Operations, 10ops-eqiad: Missing servers in racktables - https://phabricator.wikimedia.org/T186814#3956462 (10Cmjohnson) 05Open>03Resolved [19:46:50] (03PS4) 10BBlack: ulsfo: use local DNS [puppet] - 10https://gerrit.wikimedia.org/r/409061 [19:47:11] yay that fixed it [19:47:13] no_justification :) [19:47:18] mutante ema ^^ [19:47:20] fixed now [19:47:49] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:54] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3956465 (10Tgr) [19:49:49] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [19:51:13] (03PS5) 10BBlack: ulsfo: use local DNS [puppet] - 10https://gerrit.wikimedia.org/r/409061 [19:51:15] (03PS1) 10BBlack: ulsfo: use local ntp [puppet] - 10https://gerrit.wikimedia.org/r/409125 [19:52:25] (03CR) 10BBlack: [C: 032] ulsfo: use local DNS [puppet] - 10https://gerrit.wikimedia.org/r/409061 (owner: 10BBlack) [19:52:29] (03CR) 10BBlack: [C: 032] ulsfo: use local ntp [puppet] - 10https://gerrit.wikimedia.org/r/409125 (owner: 10BBlack) [19:58:00] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:59] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [20:00:04] no_justification: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180208T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:02:49] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3956541 (10BBlack) [20:02:52] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy dns400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3956539 (10BBlack) 05Open>03Resolved These are live in service for local NTP+DNS now. [20:03:24] !log gerrit: killed about 12 parallel clones of mediawiki/extensions/Math that had been running between 2-3 days (wtf?) [20:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:49] It wasn't SSH, so....how on earth did it not hit HTTP timeouts? [20:04:09] Or rather probably did, but Jetty didn't let go :\ [20:04:12] * no_justification sighs [20:04:21] Jetty? [20:04:37] go java go! :) [20:04:56] I hate jetty sometimes. [20:05:02] s/sometimes/all the times/ [20:07:06] bblack: This is what I saw: https://phabricator.wikimedia.org/P6672 (masked the user's name, it's not like it was their fault) [20:09:00] heh [20:12:51] (03PS1) 10Chad: mw.org back to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409127 [20:12:53] (03PS1) 10Chad: group1 back to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409128 [20:12:55] (03PS1) 10Chad: Group2 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409129 [20:14:19] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:30] PROBLEM - Host mw1256.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:15:32] (03PS1) 10BBlack: dns for eqsin x-vlan LVS [dns] - 10https://gerrit.wikimedia.org/r/409130 [20:16:21] (03CR) 10BBlack: [C: 032] dns for eqsin x-vlan LVS [dns] - 10https://gerrit.wikimedia.org/r/409130 (owner: 10BBlack) [20:20:20] !log starting deploy process to update scb cluster to librdkafka 0.11 and node-rdkafka 2. we will depool, stop puppet, deploy, test, start puppet on each node [20:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:39] RECOVERY - Host mw1256.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.06 ms [20:24:15] !log otto@tin Started deploy [eventstreams/deploy@7629e16]: (no justification provided) [20:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:37] !log otto@tin Finished deploy [eventstreams/deploy@7629e16]: (no justification provided) (duration: 00m 22s) [20:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:10] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:25:51] ottomata: how long you gonna be deploying? I wanna try rolling the train forward but want less moving parts since it exploded yesterday [20:26:03] !log ppchelko@tin Started deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Canary [20:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:16] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Canary (duration: 00m 13s) [20:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:58] !log ppchelko@tin Started deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary [20:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:44] !log ppchelko@tin Finished deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary (duration: 00m 46s) [20:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:09] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [20:30:49] (03PS1) 10BBlack: eqsin: switch TFTP to bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/409134 (https://phabricator.wikimedia.org/T156027) [20:30:51] (03PS1) 10BBlack: eqsin: lvs+dns configuration bits [puppet] - 10https://gerrit.wikimedia.org/r/409135 (https://phabricator.wikimedia.org/T156027) [20:31:52] (03CR) 10BBlack: [C: 032] eqsin: switch TFTP to bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/409134 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [20:32:20] (03CR) 10BBlack: [C: 032] eqsin: lvs+dns configuration bits [puppet] - 10https://gerrit.wikimedia.org/r/409135 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [20:33:29] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [20:34:13] !log ppchelko@tin Started restart [changeprop/deploy@5fdc03a]: Restart CP to force rule rebalance [20:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:52] (03PS1) 10BBlack: eqsin: bugfix lvs iface list T156027 [puppet] - 10https://gerrit.wikimedia.org/r/409142 [20:36:16] (03CR) 10BBlack: [C: 032] eqsin: bugfix lvs iface list T156027 [puppet] - 10https://gerrit.wikimedia.org/r/409142 (owner: 10BBlack) [20:40:07] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3956634 (10Dzahn) [20:43:39] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3956638 (10Krinkle) [20:43:44] 10Operations, 10Phabricator, 10RelEng-Archive-FY201718-Q1: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#3956641 (10Dzahn) [20:43:46] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3956642 (10Dzahn) [20:44:29] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:45:11] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939825 (10Krinkle) [20:47:39] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:49:17] (03PS6) 10Paladox: ircecho: Support auth over irc [puppet] - 10https://gerrit.wikimedia.org/r/405594 [20:49:20] (03CR) 10Paladox: ircecho: Support auth over irc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/405594 (owner: 10Paladox) [20:49:39] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [20:49:49] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Support auth over irc [puppet] - 10https://gerrit.wikimedia.org/r/405594 (owner: 10Paladox) [20:50:54] !log andrew@tin Started deploy [horizon/deploy@7d4a2d9]: updating with designate dashboard -- take four [20:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:31] (03PS7) 10Paladox: ircecho: Support auth over irc [puppet] - 10https://gerrit.wikimedia.org/r/405594 [20:51:42] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3956676 (10RobH) [20:51:57] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Support auth over irc [puppet] - 10https://gerrit.wikimedia.org/r/405594 (owner: 10Paladox) [20:52:30] !log andrew@tin Finished deploy [horizon/deploy@7d4a2d9]: updating with designate dashboard -- take four (duration: 01m 36s) [20:52:31] (03PS8) 10Paladox: ircecho: Support auth over irc [puppet] - 10https://gerrit.wikimedia.org/r/405594 [20:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:49] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:49] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [20:54:29] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [20:54:55] (03CR) 10Ayounsi: [C: 032] Adding IPs for cr1-codfw <--> cr1-eqsin link [dns] - 10https://gerrit.wikimedia.org/r/409113 (owner: 10Ayounsi) [20:54:58] (03PS2) 10Ayounsi: Adding IPs for cr1-codfw <--> cr1-eqsin link [dns] - 10https://gerrit.wikimedia.org/r/409113 [20:56:50] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:58] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [21:00:31] !log andrew@tin Started deploy [horizon/deploy@5e53829]: updating with designate dashboard -- take five [21:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:56] !log andrew@tin Finished deploy [horizon/deploy@5e53829]: updating with designate dashboard -- take five (duration: 01m 25s) [21:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:24] !log otto@tin Started deploy [eventstreams/deploy@7629e16]: (no justification provided) [21:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:44] !log otto@tin Finished deploy [eventstreams/deploy@7629e16]: (no justification provided) (duration: 00m 21s) [21:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:55] !log ppchelko@tin Started deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary [21:10:58] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:40] !log ppchelko@tin Finished deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary (duration: 00m 45s) [21:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:46] !log ppchelko@tin Started deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Canary [21:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:59] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Canary (duration: 00m 13s) [21:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:58] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [21:15:04] * no_justification sees lots of stuff moving about, goes for coffee instead [21:16:06] no_justification would you like me to file a bug about jetty not timing out (even though we have it set)? [21:17:18] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:21:02] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3956793 (10Dzahn) [21:21:49] paladox: Meh [21:22:12] !log otto@tin Started deploy [eventstreams/deploy@7629e16]: (no justification provided) [21:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:29] !log ppchelko@tin Started deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Canary [21:22:38] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_8092: Servers scb2004.codfw.wmnet are marked down but pooled [21:22:39] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_8092: Servers scb2003.codfw.wmnet are marked down but pooled [21:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:48] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:22:48] PROBLEM - eventstreams on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 8092: Connection refused [21:22:49] "jetty stuck during git-receive-pack" is a pretty hard to diagnose report. [21:22:53] Could be a bajillion causes [21:22:53] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Canary (duration: 00m 25s) [21:22:54] hmm, we should probably silience some alarms here [21:23:03] sorry, will be done in a sec in codfw, we will silence in eqiad [21:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:06] !log ppchelko@tin Started deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary [21:23:08] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:23:15] !log otto@tin Finished deploy [eventstreams/deploy@7629e16]: (no justification provided) (duration: 01m 03s) [21:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:18] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:38] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:23:48] RECOVERY - eventstreams on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 952 bytes in 0.098 second response time [21:23:59] !log ppchelko@tin Finished deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary (duration: 00m 54s) [21:24:08] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [21:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:18] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [21:24:38] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [21:24:38] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [21:24:48] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [21:24:48] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [21:25:38] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [21:25:39] PROBLEM - Check whether ferm is active by checking the default input chain on labtestservices2003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [21:26:15] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3956816 (10Dzahn) [21:26:18] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [21:28:42] no_justification https://bugs.chromium.org/p/gerrit/issues/detail?id=8336 [21:28:48] PROBLEM - SSH on labtestservices2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:29:28] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:29:32] paladox: Almost nothing from that is actionable. It's about as vague as "reading pages doesn't work" in MW :p [21:29:40] heh [21:29:50] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [21:29:50] RECOVERY - Check whether ferm is active by checking the default input chain on labtestservices2003 is OK: OK ferm input default policy is set [21:30:08] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 1289 days) [21:30:10] they are still using jetty 9.3 [21:30:18] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [21:30:33] looking at the recent change log for 9.3.23 there's no fixes specifically for timeout's not being enforced [21:30:39] RECOVERY - SSH on labtestservices2002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [21:32:50] !log restarted rsyslogd services on lithium and wezen to clear rsyslog tls listener on port 6514 icinga alerts [21:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:10] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3931188 (10Dzahn) 05Open>03Resolved a:03Dzahn [21:34:01] elukey: should archiva stay stopped for now? [21:37:39] no_justification i wonder, is it because we are proxying apache to jetty. [21:37:48] so jetty may have killed it, but apache may not have? [21:38:03] I think it'd be the opposite [21:38:09] Apache hung up, but Jetty didn't realize it [21:38:11] So it kept going [21:38:16] oh [21:38:24] !log otto@tin Started deploy [eventstreams/deploy@7629e16]: upgrade to librdkafka 0.11 node-rdkafka 2 [21:38:34] !log ppchelko@tin Started deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary [21:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:48] !log otto@tin Finished deploy [eventstreams/deploy@7629e16]: upgrade to librdkafka 0.11 node-rdkafka 2 (duration: 00m 24s) [21:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:04] ottomata: How much longer is this deploy going to last? (not rushing you, but need to know for train) [21:39:20] !log ppchelko@tin Finished deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary (duration: 00m 47s) [21:39:29] hm, max 30 mins, probably more like 15 no_justification [21:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:36] we are talking in wikimedia-services if you want to join and follow [21:40:12] !log ppchelko@tin Started deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary [21:40:15] Meh, train will have to extend its window then [21:40:16] !log ppchelko@tin Finished deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Canary (duration: 00m 04s) [21:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:31] I didn't know this got scheduled alongside the thurs train [21:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:35] !log ppchelko@tin Started deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Canary [21:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:50] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Canary (duration: 00m 15s) [21:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:14] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3956869 (10Dzahn) a:05Dzahn>03elukey Icinga for meitnerium looks fine. no disk space warnings. Though one thing: puppet is still disabled there... should it? [21:42:20] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3956871 (10Dzahn) 05Resolved>03Open [21:45:42] Please clean all content from my zoranzoki21@tools-bastion-03:~$ vps [21:45:43] thanks [21:46:14] I have problems with removing directories [21:46:58] (03PS1) 10Gergő Tisza: Stop PHP errors from going to the hhvm channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409171 (https://phabricator.wikimedia.org/T45086) [21:47:24] !log otto@tin Started deploy [eventstreams/deploy@7629e16]: upgrade to librdkafka 0.11 node-rdkafka 2 [21:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:09] !log otto@tin Finished deploy [eventstreams/deploy@7629e16]: upgrade to librdkafka 0.11 node-rdkafka 2 (duration: 00m 46s) [21:48:11] !log ppchelko@tin Started deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Finally we're there [21:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:31] (03PS2) 10Gergő Tisza: Stop PHP errors from going to the hhvm channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409171 (https://phabricator.wikimedia.org/T45086) [21:48:46] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@9adaa92]: Update node-rdkafka to 2.0+. Finally we're there (duration: 00m 35s) [21:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:00] !log ppchelko@tin Started deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Finally we're there [21:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:49] !log ppchelko@tin Finished deploy [changeprop/deploy@5fdc03a]: Update node-rdkafka to 2.0+. Finally we're there (duration: 00m 49s) [21:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:05] no_justification: i think we done, Pchelolo checking up that the changeprop stuff is fine, but the deploy is finished [21:53:22] !log finished upgrade of scb to librdkafka 0.11 and node-rdkafka 2 [21:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:27] no_justification: is there no train today? [21:56:43] Zoranzoki21: wrong channel is on topic for the #wikimedia-cloud channel, not here. [21:56:50] There is. I was waiting for the other deployment to finish [21:56:58] tgr: ^^^ [21:57:02] greg-g: I resolved what I wanted [21:57:05] already [21:57:39] that's fine, but do you understand the information I just gave you? [21:57:42] Zoranzoki21: ^ [21:57:52] greg-g: yes [21:58:46] !log andrew@tin Started deploy [horizon/deploy@5e53829]: updating with designate dashboard -- take... six, I guess? [21:58:49] !log andrew@tin Finished deploy [horizon/deploy@5e53829]: updating with designate dashboard -- take... six, I guess? (duration: 00m 03s) [21:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:50] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3956904 (10Dzahn) [22:08:16] !log rebooting cr1-eqsin [22:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:45] (03CR) 10星耀晨曦: "Have anyone review this patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406487 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [22:12:06] (03CR) 10Zoranzoki21: "Please rebase patch and then I will give vote. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406487 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [22:12:18] PROBLEM - DPKG on labtestneutron2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:12:36] (03CR) 10Chad: [C: 032] mw.org back to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409127 (owner: 10Chad) [22:13:18] RECOVERY - DPKG on labtestneutron2001 is OK: All packages OK [22:15:15] (03Merged) 10jenkins-bot: mw.org back to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409127 (owner: 10Chad) [22:17:12] (03CR) 10jenkins-bot: mw.org back to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409127 (owner: 10Chad) [22:17:14] (03PS5) 10星耀晨曦: Set Portal and Portal talk namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406487 (https://phabricator.wikimedia.org/T184866) [22:17:46] !log demon@tin rebuilt and synchronized wikiversions files: mw.org back to wmf.20 [22:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:10] (03CR) 10Zoranzoki21: [C: 031] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406487 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [22:20:04] (03PS3) 10Gergő Tisza: Stop PHP errors from going to the hhvm channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409171 (https://phabricator.wikimedia.org/T45086) [22:20:44] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3956940 (10RobH) 05Open>03Resolved [22:26:27] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3956947 (10chasemp) The original request was: > Disks: 8T after RAID1 with a hardware raid controller in /T162486 and I see these with: > /dev/mapper... [22:27:29] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [22:29:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [22:29:56] (03CR) 10Ayounsi: LibreNMS: Allow librenms to write file in $install_dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [22:30:28] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [22:30:38] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [22:32:06] !log bsitzmann@tin Started deploy [mobileapps/deploy@75a2ebb]: Update mobileapps to e93ab95 [22:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:13] !log bsitzmann@tin Finished deploy [mobileapps/deploy@75a2ebb]: Update mobileapps to e93ab95 (duration: 05m 07s) [22:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:26] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3956979 (10RobH) a:03RobH [22:54:04] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3956981 (10RobH) [22:55:28] (03PS1) 10RobH: stat1003 decom [puppet] - 10https://gerrit.wikimedia.org/r/409180 (https://phabricator.wikimedia.org/T175150) [22:57:13] (03PS1) 10RobH: decom stat1003 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/409181 (https://phabricator.wikimedia.org/T175150) [22:57:18] (03CR) 10RobH: [C: 032] stat1003 decom [puppet] - 10https://gerrit.wikimedia.org/r/409180 (https://phabricator.wikimedia.org/T175150) (owner: 10RobH) [22:57:56] (03CR) 10RobH: [C: 032] decom stat1003 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/409181 (https://phabricator.wikimedia.org/T175150) (owner: 10RobH) [22:59:24] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3956990 (10RobH) [22:59:42] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584101 (10RobH) a:05RobH>03Cmjohnson [23:05:28] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:05:48] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:05:48] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:05:49] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:06:09] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:06:09] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:06:28] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:20:46] (03PS2) 10Krinkle: Remove redundant wgTemplateSandboxEditNamespaces addition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363531 (owner: 10Legoktm) [23:23:36] robh: Hm.. stat1005 issues coincidence? [23:24:58] PROBLEM - IPMI Sensor Status on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:25:29] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [23:25:48] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [23:25:48] RECOVERY - DPKG on stat1005 is OK: All packages OK [23:25:58] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [23:26:09] seems so [23:26:18] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [23:26:21] sorry, i was swapping laundry [23:26:25] didnt hear the ping [23:26:28] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [23:33:54] !log ppchelko@tin Started deploy [restbase/deploy@c0f0dcd]: Fix a type that prevented the mobile partial content to have an etag [23:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:10] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Thu 2018-02-08 23:36:06 UTC. [23:40:18] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939825 (10bd808) >>! In T186288#3956391, @RobH wrote: > Changing from 'needs triage' to 'high' priority, since this is tied to quarterly goals. Who's goal and which qua... [23:40:50] no_justification: how's it going? [23:40:55] 20 minutes until swat [23:41:28] (03CR) 10Dzahn: [C: 032] "eh yea, that "tail" was just there to give some active feedback to the user when Icinga was actually done and scheduled the downtime. so i" [puppet] - 10https://gerrit.wikimedia.org/r/408989 (https://phabricator.wikimedia.org/T145192) (owner: 10Ema) [23:42:15] (03PS2) 10Dzahn: icinga-downtime: do not wait for two log lines [puppet] - 10https://gerrit.wikimedia.org/r/408989 (https://phabricator.wikimedia.org/T145192) (owner: 10Ema) [23:44:07] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3957049 (10RobH) I thought it was an ongoing goal to replace them, perhaps its just a non goal but roadmapped item. I'm sure we'll be able to find out in sre weekly meet... [23:48:06] robh: bd808 not mentioned in q3, at least: https://www.mediawiki.org/wiki/Wikimedia_Technology/Goals/2017-18_Q3 [23:48:27] i already said i was wrong =p [23:48:39] greg-g: I'm cancelling swat [23:48:40] and i got like 3 pms saying i was wrong [23:48:43] It was empty last I saw it [23:48:48] so we can all stop pointing out that i was wrong ;] [23:49:02] robh: :( sorry. I didn't mean to be part of a pile on [23:49:24] i could have sworn it was but it wasnt a goal [23:49:31] i guess it was more of a 'this needs to happen badly' [23:49:38] !log ppchelko@tin Finished deploy [restbase/deploy@c0f0dcd]: Fix a type that prevented the mobile partial content to have an etag (duration: 15m 44s) [23:49:40] and now its 'how the hell has this not already happened' [23:49:40] no_justification: word [23:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:57] bd808: no worries, the ask on task was legit im not mad at it [23:49:58] robh: lol :) sorry! was just trying to help! :P [23:50:58] my "ambitious" goal is to have Trusty gone from Cloud Services by the end of Q2 FY18/19 ;) [23:51:13] i also thought it was.. google doc with lists of suggestions for goals maybe [23:51:27] maybe it was on next years goals? [23:51:36] or just suggestion yeah [23:51:37] dunno [23:51:37] i think it came up in planning the next ones [23:51:41] either way needs to happen! [23:51:42] in some way [23:52:21] also, do people use those "goal" tags in phab or nah [23:52:39] maybe that used to be in previous years [23:53:05] elukey: *cough* stat1005.... [23:53:31] of course it is 2 am here 1 am there so odds are he won't see that til tomorrow [23:53:36] apergos: it alerted and then cleared [23:53:51] around when i was decoming stat1003 so was a bit alarming for a moment [23:53:58] it's not the alert but the oom -> chroot failing (still) [23:54:08] ah [23:54:10] there's something he restarts but dang if i remember at 2am [23:54:20] it will keep [23:54:37] mutante: not really, I have my own #releng-fy1718-q3 tags and then some use of #epics [23:55:00] RECOVERY - IPMI Sensor Status on stat1005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [23:55:12] greg-g: ok, thanks. #epic works too :p [23:55:57] oh, stat1005 was too warm. that's what it was? [23:56:08] It must've grown attached to stat1003. [23:56:19] And overheated when it was decom'ed [23:56:22] got a fever [23:56:24] :-/ [23:56:38] they know when you come for their friends [23:57:25] (03CR) 10Krinkle: [C: 031] Stop PHP errors from going to the hhvm channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409171 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [23:57:29] (ezachte) CMD (/home/ezachte/wikistats/dumps/bash/progress_wikistats.sh [23:58:26] it seems to be done