[00:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171109T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:14:12] SWAT done so. [00:15:47] dereckson: :) [00:30:53] PROBLEM - Long running screen/tmux on puppetcompiler1001 is CRITICAL: CRIT: Long running SCREEN process. (PID: 12278, 1741188s 1728000s). [00:51:57] when your hammer is uploadwizrd, everyhing starts looking like a thumbnail. good night [00:52:38] 10Operations, 10Analytics, 10DBA, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3746363 (10Tgr) Checking how much different user cohorts (bucketed by editcount) click on a button, for example. If it can be done in Hadoop, that sounds gre... [00:52:52] apergos: :D Night! [00:53:54] :o [01:01:02] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171109T0100). [01:01:04] No GERRIT patches in the queue for this window AFAICS. [01:01:25] jouncebot: no phabricator update tonight. [01:02:02] !log Not deploying any phabricator updates this week. [01:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:15] 10Operations, 10Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2346094 (10Legoktm) @kaythaney where should one report bugs in the website? I noticed that the wrong MediaWiki logo is being used. [01:32:32] (03PS1) 10Thcipriani: Add Jinja2 expression statement [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/390174 [01:33:50] (03CR) 10Thcipriani: "For more info, see code review for Ia018a0d7681ff5cdb49134464e97c3b7c210cf50" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/390174 (owner: 10Thcipriani) [01:35:19] (03PS1) 10TerraCodes: git.wikimedia.org -> phab [debs/logster] - 10https://gerrit.wikimedia.org/r/390175 (https://phabricator.wikimedia.org/T139089) [02:21:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3746453 (10bd808) [02:27:04] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2091705 [02:29:05] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.6) (duration: 09m 18s) [02:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:31] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.7) (duration: 09m 09s) [02:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:30] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Nov 9 02:58:30 UTC 2017 (duration 6m 59s) [02:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:10] (03PS11) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [03:04:40] (03CR) 10TerraCodes: [C: 031] git.wikimedia.org -> phab [software/swift-utils] - 10https://gerrit.wikimedia.org/r/390026 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [03:29:02] 10Operations, 10Parsoid, 10Traffic, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3746533 (10Arlolra) Parsoid seems to be configured correctly since https://wiki.dronelaws.io:8000/localhost/v3/page/html/Main_Page/2 renders just fin... [03:30:54] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 883.69 seconds [03:54:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 270.56 seconds [03:54:49] (03Draft2) 10Jayprakash12345: Enable the SandboxLink extension in the Mirandese Wikipedia (Third Req) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390182 [03:55:29] (03PS3) 10Jayprakash12345: Enable the SandboxLink extension in the Mirandese Wikipedia (Third Req) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390182 (https://phabricator.wikimedia.org/T180052) [04:19:22] (03Draft2) 10Jayprakash12345: Add BP and WP as aliases to project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390183 [04:20:40] (03PS3) 10Jayprakash12345: Add BP and WP as aliases to project namespace at mwlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390183 (https://phabricator.wikimedia.org/T180052) [04:56:38] (03CR) 10Krinkle: [C: 031] Remove chromium module [puppet] - 10https://gerrit.wikimedia.org/r/389971 (https://phabricator.wikimedia.org/T175093) (owner: 10Alexandros Kosiaris) [04:58:32] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3746569 (10Krinkle) >>! In T175093#3745028, @RobH wrote: > [..]. I did not touch them, as they are scripts and the hostname reference may just be cosmetic... [05:10:54] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61848 MB (12% inode=99%) [06:15:07] (03PS1) 10TerraCodes: Adjust throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) [06:28:32] (03PS2) 10Krinkle: Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [06:30:02] (03CR) 10jerkins-bot: [V: 04-1] Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [06:42:39] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3746654 (10Marostegui) >>! In T173570#3745891, @Cmjohnson wrote: > @marostegui during my decom checks I found db1015 in this file. Should a replacement be identified... [06:46:10] (03PS1) 10Marostegui: s1,s2,s5.hosts: Move db1105 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/390192 (https://phabricator.wikimedia.org/T178359) [06:51:14] RECOVERY - Disk space on elastic1025 is OK: DISK OK [07:01:03] (03PS2) 10Dzahn: phabricator: drop ferm rule to open port 443 [puppet] - 10https://gerrit.wikimedia.org/r/389457 [07:01:37] (03CR) 10Dzahn: [C: 032] phabricator: drop ferm rule to open port 443 [puppet] - 10https://gerrit.wikimedia.org/r/389457 (owner: 10Dzahn) [07:02:54] (03CR) 10Dzahn: "applied on phab1001 - phabricator works as before" [puppet] - 10https://gerrit.wikimedia.org/r/389457 (owner: 10Dzahn) [07:03:29] (03PS2) 10Dzahn: phabricator: limit http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389459 [07:04:40] !log legoktm@tin Synchronized php-1.31.0-wmf.7/resources/: Restore jquery.badge and jquery.placeholder modules (duration: 00m 53s) [07:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:55] (03CR) 10Dzahn: [C: 032] phabricator: limit http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389459 (owner: 10Dzahn) [07:07:38] (03CR) 10Dzahn: "applied on phab1001 - phabricator works as before" [puppet] - 10https://gerrit.wikimedia.org/r/389459 (owner: 10Dzahn) [07:10:06] 10Operations, 10monitoring, 10Patch-For-Review: ensure that services on labtest machines never create SMS from Icinga (not send sms pages for labtest* things to non-cloud folks) - https://phabricator.wikimedia.org/T178008#3746656 (10Dzahn) Could i please get reviews on https://gerrit.wikimedia.org/r/#/q/topi... [07:19:59] (03PS2) 10Dzahn: ci: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389432 [07:20:12] (03CR) 10Dzahn: ci: restrict http access to cache_misc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389432 (owner: 10Dzahn) [07:24:27] (03PS1) 10Dzahn: rm modules/role/manifests/requesttracker/upgradetest.pp [puppet] - 10https://gerrit.wikimedia.org/r/390199 [07:25:57] (03CR) 10Dzahn: [C: 032] rm modules/role/manifests/requesttracker/upgradetest.pp [puppet] - 10https://gerrit.wikimedia.org/r/390199 (owner: 10Dzahn) [07:28:34] 10Operations, 10Parsoid, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3746663 (10ema) [07:29:43] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 3 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875#3746666 (10ema) p:05Triage>03Normal [07:29:59] (03PS1) 10Dzahn: smokeping: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/390203 [07:30:01] 10Operations, 10Traffic: LVS IPv6 IPs should all be recorded in DNS - https://phabricator.wikimedia.org/T179026#3746667 (10ema) p:05Triage>03Normal [07:31:27] (03PS2) 10Dzahn: smokeping: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/390203 [07:35:33] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3746668 (10ema) p:05Triage>03Normal [07:36:02] (03PS1) 10Dzahn: noc: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/390205 [07:38:26] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3696395 (10ema) Anything else left to do here? Is the problem solved for you @Chicocvenancio? [07:40:04] 10Operations, 10MediaWiki-Authentication-and-authorization, 10Traffic, 10Security-Core: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604#3746674 (10ema) p:05Triage>03Normal [07:44:01] (03CR) 10Marostegui: [C: 032] s1,s2,s5.hosts: Move db1105 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/390192 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:44:46] (03Merged) 10jenkins-bot: s1,s2,s5.hosts: Move db1105 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/390192 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:46:53] <_joe_> !log restarting apache on rhodium after setting --profile --trace in the puppet settings [07:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:56] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1074_v4, cp1074_v6 [07:54:05] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 66 connecting: cp1074_v6 not-conn: cp1074_v4 [07:54:05] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:05] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1074_v4, cp1074_v6 [07:54:05] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:06] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:06] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:06] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:07] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1074_v4, cp1074_v6 [07:54:07] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:15] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1074_v4, cp1074_v6 [07:54:15] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:16] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:16] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:35] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:35] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:36] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1074_v4, cp1074_v6 [07:54:45] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1074_v4, cp1074_v6 [07:54:45] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1074_v4, cp1074_v6 [07:54:46] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:46] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:55] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:56] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:54:56] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1074_v4, cp1074_v6 [07:55:11] looking ^ [07:56:15] PROBLEM - Host cp1074 is DOWN: PING CRITICAL - Packet loss = 100% [07:56:51] !log cp1074 failed rebooting, power-cycled [07:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:36] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 68 ESP OK [07:58:45] RECOVERY - Host cp1074 is UP: PING WARNING - Packet loss = 58%, RTA = 0.18 ms [07:58:45] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 68 ESP OK [07:58:45] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 68 ESP OK [07:58:46] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 54 ESP OK [07:58:46] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [07:59:05] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 54 ESP OK [07:59:05] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [07:59:05] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [07:59:05] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 68 ESP OK [07:59:06] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 68 ESP OK [07:59:06] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [07:59:06] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 68 ESP OK [07:59:15] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [07:59:15] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [07:59:15] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 68 ESP OK [07:59:15] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [07:59:15] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [07:59:16] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 54 ESP OK [07:59:16] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 68 ESP OK [07:59:25] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 54 ESP OK [07:59:25] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 54 ESP OK [07:59:25] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [07:59:35] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [07:59:36] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [08:03:04] (03CR) 10Hashar: [C: 031] "Seems good :] It used to be reacheable directly by the server public IP, but that is effectively entirely behind the misc cache nowadays" [puppet] - 10https://gerrit.wikimedia.org/r/389432 (owner: 10Dzahn) [08:13:14] (03PS3) 10Dzahn: ci: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389432 [08:14:37] (03CR) 10Dzahn: [C: 032] ci: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389432 (owner: 10Dzahn) [08:16:21] (03CR) 10Dzahn: "applied on contint1001 - https://integration.wikimedia.org/ and https://doc.wikimedia.org works as before" [puppet] - 10https://gerrit.wikimedia.org/r/389432 (owner: 10Dzahn) [08:17:54] (03PS3) 10Dzahn: smokeping: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/390203 [08:21:29] (03CR) 10Dzahn: [C: 032] smokeping: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/390203 (owner: 10Dzahn) [08:22:50] (03CR) 10Dzahn: "applied on netmon2001/netmon1002. https://smokeping.wikimedia.org works as before" [puppet] - 10https://gerrit.wikimedia.org/r/390203 (owner: 10Dzahn) [08:24:17] (03PS2) 10Dzahn: noc: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/390205 [08:24:46] (03CR) 10Dzahn: "also behind misc-web, active/active terbium/wasat mw maintenance servers" [puppet] - 10https://gerrit.wikimedia.org/r/390205 (owner: 10Dzahn) [08:25:51] (03CR) 10Dzahn: [C: 032] noc: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/390205 (owner: 10Dzahn) [08:27:06] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [08:27:55] (03CR) 10Dzahn: "applied on terbium and wasat - https://noc.wikimedia.org works normal" [puppet] - 10https://gerrit.wikimedia.org/r/390205 (owner: 10Dzahn) [08:41:42] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] git.wikimedia.org -> phab [software/swift-utils] - 10https://gerrit.wikimedia.org/r/390026 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [08:53:31] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099#3746748 (10Joe) I was able to extract a semi-meaningful backtrace from rhodium: ``` Nov 9 08:44:22 rhodium puppet-master[3889]: undefined me... [08:57:18] (03PS6) 10Gehel: apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 [09:15:26] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3746818 (10Dzahn) [09:35:56] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 21 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:38:36] 10Operations, 10Patch-For-Review, 10User-Urbanecm, 10Wiki-Setup (Create): Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3746860 (10Dzahn) [09:40:56] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 10 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:46:55] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [09:49:10] (03CR) 10Alexandros Kosiaris: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/8701/elastic1020.eqiad.wmnet/ says ok as well, I guess this is good to go" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [09:49:38] akosiaris, moritzm: thanks! [09:50:38] !log rolling reboot of scb in eqiad for kernel update (also to pick up openssl updates) [09:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:14] (03PS7) 10Gehel: apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 [09:54:28] (03CR) 10Gehel: [C: 032] apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [09:56:04] (03CR) 10Alexandros Kosiaris: "I 've responded on https://phabricator.wikimedia.org/P6286 with deployment-prep's sources.list. Not very consistent but we should aim to f" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [09:59:35] RECOVERY - puppet last run on puppetcompiler1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:04:12] 10Operations, 10Performance-Team: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3746989 (10Gilles) lawrencium.eqiad.wmnet prompts me for a password, I imagine I don't have SSH access to it? The perf-team shell group would be the correct one to use here. [10:09:07] (03PS1) 10Muehlenhoff: Grant perf-team access to lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/390214 (https://phabricator.wikimedia.org/T179968) [10:22:14] (03PS1) 10Dzahn: shell access for perf-team to lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/390217 (https://phabricator.wikimedia.org/T179968) [10:26:23] (03Abandoned) 10Dzahn: shell access for perf-team to lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/390217 (https://phabricator.wikimedia.org/T179968) (owner: 10Dzahn) [10:26:35] PROBLEM - puppet last run on puppetcompiler1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:27:07] (03PS2) 10Dzahn: Grant perf-team access to lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/390214 (https://phabricator.wikimedia.org/T179968) (owner: 10Muehlenhoff) [10:27:39] (03CR) 10Dzahn: [C: 032] Grant perf-team access to lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/390214 (https://phabricator.wikimedia.org/T179968) (owner: 10Muehlenhoff) [10:29:21] 10Operations, 10Performance-Team, 10Patch-For-Review: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3742297 (10Dzahn) @gilles it should work now ``` [lawrencium:~] $ id gilles uid=4319(gilles) gid=500(wikidev) groups=500(wikidev),796(perf-team) ``` [10:32:42] (03PS1) 10Dzahn: lawrencium: role spare -> role test [puppet] - 10https://gerrit.wikimedia.org/r/390218 (https://phabricator.wikimedia.org/T179968) [10:33:20] (03CR) 10Muehlenhoff: [C: 031] lawrencium: role spare -> role test [puppet] - 10https://gerrit.wikimedia.org/r/390218 (https://phabricator.wikimedia.org/T179968) (owner: 10Dzahn) [10:33:46] (03CR) 10Dzahn: [C: 032] lawrencium: role spare -> role test [puppet] - 10https://gerrit.wikimedia.org/r/390218 (https://phabricator.wikimedia.org/T179968) (owner: 10Dzahn) [10:39:15] (03CR) 10Alexandros Kosiaris: [C: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/389971 (https://phabricator.wikimedia.org/T175093) (owner: 10Alexandros Kosiaris) [10:39:20] (03PS2) 10Alexandros Kosiaris: Remove chromium module [puppet] - 10https://gerrit.wikimedia.org/r/389971 (https://phabricator.wikimedia.org/T175093) [10:39:22] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove chromium module [puppet] - 10https://gerrit.wikimedia.org/r/389971 (https://phabricator.wikimedia.org/T175093) (owner: 10Alexandros Kosiaris) [10:53:33] 10Operations, 10Services, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051#3747151 (10dcausse) I think we should introduce a pattern where log emitters can freely send large and complex obj... [10:54:53] (03CR) 10Alexandros Kosiaris: [C: 031] "Let me know if you want to be present when I merge this" [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [11:03:58] !log rebooting mw1276-mw1279 (API canaries) to 4.9.5 (also to pick up new OpenSSL) [11:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:38] RECOVERY - puppet last run on puppetcompiler1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:06:45] (03PS4) 10Filippo Giunchedi: role: Prometheus https access to k8s apiserver / node [puppet] - 10https://gerrit.wikimedia.org/r/389929 (https://phabricator.wikimedia.org/T177395) [11:07:34] (03CR) 10Filippo Giunchedi: [C: 032] role: Prometheus https access to k8s apiserver / node [puppet] - 10https://gerrit.wikimedia.org/r/389929 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [11:08:18] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:18] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:18] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 110 not-conn: cp2008_v4, cp2008_v6 [11:08:19] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:19] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:19] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:19] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:28] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:28] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:28] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:28] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:28] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:28] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:29] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:29] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:30] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:38] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:40] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 110 not-conn: cp2008_v4, cp2008_v6 [11:08:41] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:45] looking [11:08:48] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 110 not-conn: cp2008_v4, cp2008_v6 [11:08:49] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:49] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:58] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 110 not-conn: cp2008_v4, cp2008_v6 [11:08:58] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:58] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:08:59] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:09:09] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 110 not-conn: cp2008_v4, cp2008_v6 [11:09:09] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 110 not-conn: cp2008_v4, cp2008_v6 [11:09:09] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:09:09] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:09:18] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:09:18] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:09:18] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:09:18] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [11:09:41] console output is not particularly useful [11:09:42] [1318520.125840] [11:10:26] !log powercycle cp2008, stuck rebooting [11:10:27] it's like "Lost", those numbers may have a meaning [11:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:38] PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100% [11:12:54] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 112 ESP OK [11:12:54] RECOVERY - Host cp2008 is UP: PING OK - Packet loss = 0%, RTA = 36.03 ms [11:12:55] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [11:12:55] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [11:12:55] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 112 ESP OK [11:13:04] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 54 ESP OK [11:13:04] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 54 ESP OK [11:13:05] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [11:13:14] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 112 ESP OK [11:13:15] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 112 ESP OK [11:13:15] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 54 ESP OK [11:13:15] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 54 ESP OK [11:13:15] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 54 ESP OK [11:13:15] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [11:13:34] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 54 ESP OK [11:13:34] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 54 ESP OK [11:13:34] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 54 ESP OK [11:13:44] PROBLEM - puppet last run on puppetcompiler1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:13:44] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 112 ESP OK [11:13:44] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [11:13:45] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [11:13:45] PROBLEM - puppetmaster backend https on rhodium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [11:15:04] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 54 ESP OK [11:15:45] RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 331 bytes in 0.061 second response time [11:16:45] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [11:16:45] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [11:18:25] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 54 ESP OK [11:18:44] RECOVERY - puppet last run on puppetcompiler1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:20:15] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 54 ESP OK [11:21:55] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 54 ESP OK [11:23:44] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1063_v4, cp1063_v6 [11:24:44] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 68 ESP OK [11:25:04] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [11:25:04] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [11:25:25] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [11:25:44] PROBLEM - puppet last run on puppetcompiler1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:25:48] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099#3747304 (10Joe) Mistery solved: in the method `@ssh_host.certificate` calls, that is `Puppet::SSL::Host.certificate`, we have ``` return... [11:26:18] 10Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#3747306 (10MoritzMuehlenhoff) Your patch is also missing in the Ganeti version in stretch, let's report it to the Debian BTS so that it can possibly be backported to a stretch point release? [11:27:14] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 54 ESP OK [11:27:14] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 54 ESP OK [11:27:14] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 54 ESP OK [11:27:14] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 54 ESP OK [11:28:17] 10Operations, 10Services, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051#3747313 (10dcausse) Typically logstash/elastic is not able to sustain these kind of events: https://logstash.wikim... [11:28:55] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [11:30:45] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 112 ESP OK [11:37:41] <_joe_> !log cleaning up spurious directories /var/lib/puppet/server/ssl/ca from eqiad's puppetmaster backends, generated due to some error on 8/11/2017 [11:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:09] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099#3747368 (10Joe) In all this, some random person revoked puppetmaster1001's own certificate, which is used to access the ca_server, as far as I... [11:45:11] <_joe_> !log removed all local hacks from puppetmaster1001, now it uses rhodium again [11:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:30] !log rebooting mw1180-mw1188 (app servers) to 4.9.5 (also to pick up new OpenSSL) [11:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:00] RECOVERY - puppet last run on puppetcompiler1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:10:52] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3747466 (10Ladsgroup) This is not my decision to make, our PM is not around, I'll ask her when she's back [12:21:46] (03PS3) 10ArielGlenn: clean up dumps web server rsync to its fallback [puppet] - 10https://gerrit.wikimedia.org/r/390165 (https://phabricator.wikimedia.org/T179942) [12:31:14] (03PS4) 10ArielGlenn: clean up dumps web server rsync to its fallback [puppet] - 10https://gerrit.wikimedia.org/r/390165 (https://phabricator.wikimedia.org/T179942) [12:34:13] (03CR) 10Elukey: [C: 031] "I was unable to test this change on rdb1002 since pcc failed querying puppetdb (https://puppet-compiler.wmflabs.org/compiler02/8700/rdb100" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [12:35:52] 10Operations: Reboot of dumps hosts - https://phabricator.wikimedia.org/T180127#3747569 (10MoritzMuehlenhoff) [12:37:34] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboot of dumps hosts - https://phabricator.wikimedia.org/T180127#3747584 (10ArielGlenn) [12:37:53] !log ladsgroup@terbium:/srv/mediawiki-staging/php-1.31.0-wmf.6$ mwscript extensions/ORES/maintenance/CheckModelVersions.php --wiki=frwiki (T180115) [12:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:01] T180115: [regression] ORES filters are not available on French Wikipedia naymore - https://phabricator.wikimedia.org/T180115 [12:38:18] !log rebooting mw1189-mw1208 (API servers) to 4.9.5 (also to pick up new OpenSSL) [12:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:47] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3747609 (10elukey) ping @Cmjohnson :) [12:54:58] 10Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#3747672 (10akosiaris) Yes there hasn't been any release since the time of that patch, so let's do that. Filed it in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=881255 [13:04:20] !log rebooting mw1209-mw1220 (app servers) to 4.9.51 (also to pick up new OpenSSL) [13:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:04] (03PS1) 10Elukey: hadoop: raise jvm heap sizes for HDFS datanode and Yarn daemons [puppet] - 10https://gerrit.wikimedia.org/r/390237 (https://phabricator.wikimedia.org/T178876) [13:08:42] 10Operations, 10Performance-Team, 10Patch-For-Review: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3747695 (10Gilles) 05Open>03Resolved It does work, thank you. [13:11:43] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "This is what this patch does:" [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [13:16:25] (03PS5) 10ArielGlenn: clean up dumps web server rsync to its fallback [puppet] - 10https://gerrit.wikimedia.org/r/390165 (https://phabricator.wikimedia.org/T179942) [13:18:28] (03PS6) 10ArielGlenn: clean up dumps web server rsync to its fallback [puppet] - 10https://gerrit.wikimedia.org/r/390165 (https://phabricator.wikimedia.org/T179942) [13:20:38] (03PS7) 10ArielGlenn: clean up dumps web server rsync to its fallback [puppet] - 10https://gerrit.wikimedia.org/r/390165 (https://phabricator.wikimedia.org/T179942) [13:21:30] (03CR) 10ArielGlenn: [C: 032] clean up dumps web server rsync to its fallback [puppet] - 10https://gerrit.wikimedia.org/r/390165 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [13:55:20] (03PS1) 10Ladsgroup: Use a threshold that ores in frwiki can stand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390239 (https://phabricator.wikimedia.org/T180115) [13:56:45] That'd be great if this can go into SWAT [13:58:57] Added to the calendar, hope the bot picks it up [13:59:12] halfak|Mobile: The patch that fixes it is going to be deployed soon [14:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171109T1400). [14:00:08] kart_ and Pchelolo: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] * kart_ is here. [14:00:27] I added a patch too [14:00:29] anybody wants to swat? once, twice... [14:00:38] (03PS1) 10Dzahn: passwords: update labs root key for Daniel [labs/private] - 10https://gerrit.wikimedia.org/r/390240 [14:00:41] (I can swat, by the way) [14:01:02] cool [14:01:17] I can take of my own patch when the SWAT is done [14:01:29] Amir1: deal! :) [14:01:29] zeljkof: hehe seems like in the last week I've been keeping you very busy with SWATs [14:01:40] ok, for the record... [14:01:44] I can SWAT today! [14:01:49] (03CR) 10Dzahn: [C: 04-1] passwords: update labs root key for Daniel [labs/private] - 10https://gerrit.wikimedia.org/r/390240 (owner: 10Dzahn) [14:01:54] Pchelolo: it's literally my job! ;) [14:02:11] job security :) [14:02:37] I am actually asking almost every time if people want to deploy their own patches ;P [14:02:57] 10Operations, 10Analytics, 10DBA, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3747806 (10Ottomata) > Would there be a separate webrequest hadoop access and EL-only Hadoop access? It doesn't work 100% as it should, but for Hadoop access... [14:03:12] kart_, Pchelolo: does your patch take a long time to test? [14:03:44] zeljkof: my's non-testable. Logging again, we just can't get it right with these logstash type conflicts.. [14:03:52] zeljkof: I've kept article ready to test, should take 3-4 minutes to confirm. [14:04:04] (03CR) 10Joal: [C: 031] "LGTM !" [puppet] - 10https://gerrit.wikimedia.org/r/390237 (https://phabricator.wikimedia.org/T178876) (owner: 10Elukey) [14:04:37] (03CR) 10Ottomata: [C: 031] hadoop: raise jvm heap sizes for HDFS datanode and Yarn daemons [puppet] - 10https://gerrit.wikimedia.org/r/390237 (https://phabricator.wikimedia.org/T178876) (owner: 10Elukey) [14:04:59] ok, staring with kart_'s patch then [14:05:11] kart_, Pchelolo: let me know if you want to deploy yourself [14:06:22] zeljkof: no :) [14:06:45] (03PS2) 10Dzahn: passwords: update labs root key for Daniel [labs/private] - 10https://gerrit.wikimedia.org/r/390240 [14:06:48] kart_: you should try it once, it's fun ;) [14:07:32] Pchelolo: I can merge both your changes and deploy together? [14:07:50] (03PS1) 10Elukey: role::druid::analytics::worker: allow Hadoop worker nodes to contact zk [puppet] - 10https://gerrit.wikimedia.org/r/390242 [14:07:52] yup zeljkof [14:08:07] Pchelolo: ok, doing that so we don't have to wait for CI [14:08:28] zeljkof: next time for sure. [14:08:40] one depend on another though, 390211 goes first [14:08:51] (03CR) 10Joal: [C: 031] "+1 ! Thanks a lot elukey" [puppet] - 10https://gerrit.wikimedia.org/r/390242 (owner: 10Elukey) [14:08:56] (merges first) [14:09:01] Pchelolo: if I merge them both, they are deployed together [14:09:09] oh yes, will merge in the order in calendar [14:09:14] apergos: /rsync_from_webserver.sh: missing argument --desthost [14:09:21] (03CR) 10Ottomata: [C: 031] role::druid::analytics::worker: allow Hadoop worker nodes to contact zk [puppet] - 10https://gerrit.wikimedia.org/r/390242 (owner: 10Elukey) [14:09:58] !log Decommissioning Cassandra, restbase2004-b.codfw.wmnet (T179422) [14:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:06] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [14:10:14] mutante_: thanks, will check it out [14:10:22] apergos: :) [14:11:49] 10Operations, 10Services, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051#3747867 (10Pchelolo) In general there are several issues we've observed: Mapping conflicts between different serv... [14:12:09] (03CR) 10Elukey: [C: 032] role::druid::analytics::worker: allow Hadoop worker nodes to contact zk [puppet] - 10https://gerrit.wikimedia.org/r/390242 (owner: 10Elukey) [14:12:55] (03PS1) 10Muehlenhoff: Add shell user for phedenskog [puppet] - 10https://gerrit.wikimedia.org/r/390244 (https://phabricator.wikimedia.org/T179729) [14:12:57] (03PS1) 10Muehlenhoff: Add phedenskog to perf-team group [puppet] - 10https://gerrit.wikimedia.org/r/390245 (https://phabricator.wikimedia.org/T179729) [14:14:35] (03PS2) 10Hashar: contint: migrate castor to a profile [puppet] - 10https://gerrit.wikimedia.org/r/386812 [14:15:08] (03CR) 10Dzahn: [C: 031] Add phedenskog to perf-team group [puppet] - 10https://gerrit.wikimedia.org/r/390245 (https://phabricator.wikimedia.org/T179729) (owner: 10Muehlenhoff) [14:15:46] (03PS1) 10ArielGlenn: fix up dumpsdata rsync argument [puppet] - 10https://gerrit.wikimedia.org/r/390246 [14:16:08] (03CR) 10jerkins-bot: [V: 04-1] fix up dumpsdata rsync argument [puppet] - 10https://gerrit.wikimedia.org/r/390246 (owner: 10ArielGlenn) [14:16:31] (03CR) 10Hashar: [V: 031 C: 031] "Labs only. I cherry picked it on the CI puppet master and ran puppet on the sole instance using it ( castor02.integration.eqiad.wmflabs )" [puppet] - 10https://gerrit.wikimedia.org/r/386812 (owner: 10Hashar) [14:18:52] kart_: the patch is at mwdebug1002 [14:19:14] cool. Testing. [14:20:36] (03PS2) 10Hashar: contint: migrate publisher to a profile [puppet] - 10https://gerrit.wikimedia.org/r/386813 [14:20:43] kart_: some problems [14:20:55] ah [14:20:58] tail -n 1000 /srv/mw-log/hhvm.log [14:21:18] 42 Notice: Undefined variable: translation in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ContentTranslation/api/ApiQueryContentTranslation.php on line 126 [14:21:24] (03PS2) 10ArielGlenn: fix up dumpsdata rsync argument [puppet] - 10https://gerrit.wikimedia.org/r/390246 [14:21:37] 41 Notice: Undefined property: ApiQueryContentTranslation::$user in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ContentTranslation/api/ApiQueryContentTranslation.php on line 120 [14:21:49] 22 Notice: Undefined variable: translation in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ContentTranslation/api/ApiQueryContentTranslation.php on line 134 [14:22:47] wmf.6? [14:23:22] hm, just notice that [14:24:04] AFAIK, that's fixed, but let me confirm. [14:25:16] !log zfilipin@tin Synchronized php-1.31.0-wmf.7/extensions/EventBus/EventBus.php: SWAT: [[gerrit:390211|Logging improvements]] [[gerrit:390212|Rename logged field to fix logstash mapping]] (duration: 00m 54s) [14:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:35] (03CR) 10Hashar: [V: 031 C: 031] "Labs only. The sole instance having that role is integration-publishing.integration.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/386813 (owner: 10Hashar) [14:26:14] (03Abandoned) 10Hashar: Apply jenkins agent username from hiera [puppet] - 10https://gerrit.wikimedia.org/r/379729 (owner: 10Hashar) [14:26:30] Pchelolo: deployed your commits, please monitor logs and thanks for deploying with #releng ;) [14:26:34] zeljkof: got time for one more commit? [14:26:45] thank you zeljkof [14:27:09] ori: sure, we are at the half of the window [14:27:20] zeljkof: OK. That's fixed. [14:27:26] ori: please add it to the calendar [14:27:30] zeljkof: so let me test my patch again. [14:27:31] all good zeljkof [14:27:34] doing so [14:27:47] the effect I was hoping for didn't make me wait for it for too long [14:28:12] Can I get my patch in? ORES is disabled in frwiki because of it [14:28:59] Amir1: it's a config change, right? I think it's safe to deploy it while kart_ is testing his commit in ContentTranslation [14:29:00] zeljkof: go ahead with deployment. [14:29:08] kart_: deploying [14:29:25] zeljkof: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1775089&oldid=1775086 [14:29:26] I thought it's done sorry [14:29:35] https://gerrit.wikimedia.org/r/#/c/390073/ is the patch [14:29:53] Amir1: in a few seconds, will let you know [14:30:03] Thanks [14:30:17] Amir1: you should have said it's urgent, I have forgot to ask, you could have deployed first [14:30:24] 10Operations, 10Traffic, 10netops: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748043 (10BBlack) I haven't had time to analyze it deeply/manually, but I managed to capture/filter down tcpdump verbose/stamped outputs for exactly one... [14:30:24] !log zfilipin@tin Synchronized php-1.31.0-wmf.7/extensions/ContentTranslation/modules/: SWAT: [[gerrit:390206|Bring back the overlay support for a specific screen region (T179997)]] (duration: 00m 50s) [14:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:30] T179997: Saved translation fails to load - https://phabricator.wikimedia.org/T179997 [14:30:31] kart_: deployed, please check [14:30:43] Amir1: your turn, let me know when I can continue [14:30:49] thanks [14:30:52] ori: looking... [14:30:56] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390239 (https://phabricator.wikimedia.org/T180115) (owner: 10Ladsgroup) [14:31:06] (03PS11) 10Filippo Giunchedi: prometheus: add redis_exporter class and profile [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) [14:31:08] kart_: oh forgot to say, thanks for deploying with #releng ;) [14:31:40] :) [14:31:47] Where is my sticker? [14:32:18] kart_: sticker? [14:32:33] (03PS3) 10ArielGlenn: fix up dumpsdata rsync argument [puppet] - 10https://gerrit.wikimedia.org/r/390246 [14:32:44] (03Merged) 10jenkins-bot: Use a threshold that ores in frwiki can stand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390239 (https://phabricator.wikimedia.org/T180115) (owner: 10Ladsgroup) [14:32:57] (03CR) 10jenkins-bot: Use a threshold that ores in frwiki can stand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390239 (https://phabricator.wikimedia.org/T180115) (owner: 10Ladsgroup) [14:34:07] (03CR) 10ArielGlenn: [C: 032] fix up dumpsdata rsync argument [puppet] - 10https://gerrit.wikimedia.org/r/390246 (owner: 10ArielGlenn) [14:34:34] (03CR) 10Filippo Giunchedi: prometheus: add redis_exporter class and profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [14:34:37] ori: is your patch testable at mwdebug1002? [14:34:44] (it's not there yet, just asking) [14:35:04] not entirely, but it would be nice to verify nothing blows up, so if you could get it there, that would be useful for me [14:35:24] ori: sure, will get it there first before full deployment [14:35:47] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Use a threshold that ores in frwiki can stand (T180115) (duration: 00m 50s) [14:35:48] (waiting for Amir1 to finish with his deploy) [14:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:54] T180115: [regression] ORES filters are not available on French Wikipedia anymore - https://phabricator.wikimedia.org/T180115 [14:36:02] It's done now [14:36:19] Amir1: ok, taking over [14:36:34] ori: merging your change, will let you know when it's at mwdebug [14:36:39] thank you [14:36:43] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390073 (owner: 10Ori.livneh) [14:37:27] (03CR) 10Gilles: [C: 031] webperf: Refactor tests to directly associate expected data with cases [puppet] - 10https://gerrit.wikimedia.org/r/390083 (owner: 10Krinkle) [14:37:51] (03Merged) 10jenkins-bot: xenon: encode the request method as a virtual stack frame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390073 (owner: 10Ori.livneh) [14:38:57] ori: it's at mwdebug1002 [14:39:04] let me know if I can deploy [14:39:04] verifying [14:39:26] Hi, I have 5 users reporting that watchlists (and loading pages) are slow to access on French Wikipedia. Any known cause? [14:39:32] zeljkof: LGTM, please go ahead [14:39:38] ori: deploying... [14:39:39] ref: https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Le_Bistro/9_novembre_2017#Temps_d.27acc.C3.A8s_aux_donn.C3.A9es [14:39:57] (03PS3) 10Dzahn: contint: migrate castor to a profile [puppet] - 10https://gerrit.wikimedia.org/r/386812 (owner: 10Hashar) [14:40:06] (03CR) 10Gilles: [C: 031] webperf: Record navtiming discards to Graphite, and add is_sane test [puppet] - 10https://gerrit.wikimedia.org/r/390061 (owner: 10Krinkle) [14:40:10] (03CR) 10jenkins-bot: xenon: encode the request method as a virtual stack frame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390073 (owner: 10Ori.livneh) [14:40:28] Amir1: did you say there was something with frwiki? cc Trizek [14:40:36] !log zfilipin@tin Synchronized wmf-config/StartProfiler.php: SWAT: [[gerrit:390073|xenon: encode the request method as a virtual stack frame]] (duration: 00m 50s) [14:40:41] PROBLEM - Disk space on furud is CRITICAL: DISK CRITICAL - free space: /mnt/2a 1248862 MB (3% inode=96%) [14:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:42] He reported it ::D [14:40:45] zeljkof: ORES is down on fr as well. [14:40:51] PROBLEM - Disk space on flerovium is CRITICAL: DISK CRITICAL - free space: /mnt/2a 1248881 MB (3% inode=96%) [14:41:07] Is there a relation between the two cases? [14:41:09] Trizek: not anymore, You need to wait a little [14:41:09] ori: deployed, please check and thanks for deploying with #releng [14:41:15] this patch fixes it [14:41:21] Good news Amir1! Thak you ! [14:41:25] +n [14:41:53] Trizek: Look at the last chart in here: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&from=now-1h&to=now [14:41:57] why does ORES being down make wikipedia-fr slow? [14:41:58] 20 minutes left, anybody else has patches for EU SWAT? [14:42:03] zeljkof: thanks again [14:42:10] the failed ones went to zero [14:42:17] ori: frwiki is not slow AFAIK [14:42:32] Hi, I have 5 users reporting that watchlists (and loading pages) are slow to access on French Wikipedia. Any known cause? [14:42:39] ores itself it's down either too, the config wasn't compatible [14:42:48] !log EU SWAT finished [14:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:06] Trizek: ori I don't know about that part [14:43:52] Trizek: ORES is back there, also it's fast for me [14:44:49] Trizek: I would suggest the following: try disabling JavaScript and see if that makes a big difference. If so, check to see if any gadgets have recently been enabled for all users, or if there are any suspicious changes to MediaWiki:Common.js [14:44:59] if not, report it on Phabricator to the performance team [14:45:09] Will do, thanks ori [14:45:20] gilles: fyi ^ [14:47:41] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is rather ill-conceived. This should not be solved at this level but rather internally in ferm. Aside from that this creates spurious" [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) (owner: 10Hashar) [14:48:30] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:49:28] \o/ [14:49:41] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:52:19] zeljkof: eu swat finished? [14:52:32] addshore: yes [14:52:40] I might tack a tiny bit onto the end! :D [14:52:41] you still have the time for a quick deploy ;) [14:53:46] (03PS2) 10Addshore: Add AdvancedSearch to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379714 [14:53:56] (03PS2) 10Addshore: Enable AdvancedSearch on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379715 [14:54:03] #speedy [14:54:38] (03CR) 10Addshore: [C: 032] Add AdvancedSearch to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379714 (owner: 10Addshore) [14:54:44] zeljkof: labs / beta only anyway! [14:55:48] (03Merged) 10jenkins-bot: Add AdvancedSearch to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379714 (owner: 10Addshore) [14:56:06] (03CR) 10Dzahn: [C: 032] "per "labs-only and already cherry-picked"" [puppet] - 10https://gerrit.wikimedia.org/r/386812 (owner: 10Hashar) [14:57:06] (03CR) 10jenkins-bot: Add AdvancedSearch to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379714 (owner: 10Addshore) [14:57:44] (03CR) 10Addshore: [C: 032] Enable AdvancedSearch on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379715 (owner: 10Addshore) [14:57:53] !log addshore@tin Synchronized wmf-config/extension-list-labs: [[gerrit:379714|Add AdvancedSearch to extension-list-labs]] LABS / BETA ONLY (duration: 00m 50s) [14:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:49] (03PS12) 10Filippo Giunchedi: prometheus: add redis_exporter class and profile [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) [15:00:36] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add redis_exporter class and profile [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [15:00:59] (03PS3) 10Addshore: Enable AdvancedSearch on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379715 [15:00:59] (03CR) 10Addshore: [C: 032] Enable AdvancedSearch on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379715 (owner: 10Addshore) [15:02:04] (03Merged) 10jenkins-bot: Enable AdvancedSearch on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379715 (owner: 10Addshore) [15:03:21] (03PS1) 10Filippo Giunchedi: prometheus: fix notify prometheus-redis-exporter [puppet] - 10https://gerrit.wikimedia.org/r/390255 (https://phabricator.wikimedia.org/T148637) [15:03:56] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix notify prometheus-redis-exporter [puppet] - 10https://gerrit.wikimedia.org/r/390255 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [15:04:02] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:379715|Add AdvancedSearch to extension-list-labs]] LABS / BETA ONLY PT1/2 (duration: 00m 50s) [15:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:33] !log last sync was actually "Enable AdvancedSearch on beta" [15:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:34] !log addshore@tin Synchronized wmf-config/CommonSettings-labs.php: [[gerrit:379715|Enable AdvancedSearch on beta]] LABS / BETA ONLY PT2/2 (duration: 00m 49s) [15:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:44] done! [15:06:10] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:08:01] (03CR) 10BryanDavis: passwords: add labs key for arturo (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/390027 (owner: 10Arturo Borrero Gonzalez) [15:08:21] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051#3748141 (10Pchelolo) [15:10:17] (03PS1) 10Ema: WIP: varnish: log slow requests [puppet] - 10https://gerrit.wikimedia.org/r/390258 [15:11:10] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:12:46] !log Creating mathoid schema (T179419) [15:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:53] T179419: Migrate mathoid storage from legacy to new strategy - https://phabricator.wikimedia.org/T179419 [15:16:11] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051#3748176 (10dcausse) My fear is that the "too many fields" problem is going to be more painful than the... [15:16:57] (03CR) 10jenkins-bot: Enable AdvancedSearch on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379715 (owner: 10Addshore) [15:19:21] (03PS1) 10Filippo Giunchedi: prometheus: have multi-instance redis-exporter running, stop default one [puppet] - 10https://gerrit.wikimedia.org/r/390260 (https://phabricator.wikimedia.org/T148637) [15:20:32] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: have multi-instance redis-exporter running, stop default one [puppet] - 10https://gerrit.wikimedia.org/r/390260 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [15:25:31] PROBLEM - Disk space on lawrencium is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/3923d403d27ed29422009d97e80fb4d69c598ee19999d161939be850dbbb808d/merged is not accessible: Permission denied [15:26:40] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:11] rdb is me, fixing [15:27:31] RECOVERY - Disk space on lawrencium is OK: DISK OK [15:28:10] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:44] (03PS8) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [15:29:11] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:34] (03PS1) 10Filippo Giunchedi: prometheus: switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/390262 (https://phabricator.wikimedia.org/T148637) [15:29:37] (03PS2) 10Arturo Borrero Gonzalez: passwords: add labs key for arturo [labs/private] - 10https://gerrit.wikimedia.org/r/390027 [15:31:07] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/390262 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [15:33:10] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:35:37] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: setup/deploy dns400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3748216 (10RobH) [15:36:31] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:40] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:36:41] PROBLEM - puppet last run on rdb2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:37:05] (03CR) 10BryanDavis: [C: 031] passwords: add labs key for arturo [labs/private] - 10https://gerrit.wikimedia.org/r/390027 (owner: 10Arturo Borrero Gonzalez) [15:38:26] <_joe_> I am not sure that's the place you should add your ssh keys to :P [15:41:41] RECOVERY - puppet last run on rdb2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:47:04] (03PS1) 10Filippo Giunchedi: prometheus: pass -redis.addr to redis-exporter [puppet] - 10https://gerrit.wikimedia.org/r/390263 (https://phabricator.wikimedia.org/T148637) [15:47:44] (03PS1) 10Alexandros Kosiaris: prometheus: Force using read-only kubelet API [puppet] - 10https://gerrit.wikimedia.org/r/390264 (https://phabricator.wikimedia.org/T177395) [15:48:15] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: pass -redis.addr to redis-exporter [puppet] - 10https://gerrit.wikimedia.org/r/390263 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [15:51:20] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:51:45] <_joe_> akosiaris: is that you? ^^ [15:53:36] yes [15:55:04] (03PS1) 10Filippo Giunchedi: prometheus: fix port vs instance [puppet] - 10https://gerrit.wikimedia.org/r/390266 [15:55:37] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix port vs instance [puppet] - 10https://gerrit.wikimedia.org/r/390266 (owner: 10Filippo Giunchedi) [15:57:11] 10Operations, 10Traffic, 10netops: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748250 (10BBlack) Annotating some basic thoughts on the above (keep in mind with various kinds of offload in play, packetization/MTU/checksum will often... [15:57:13] (03PS2) 10Alexandros Kosiaris: prometheus: Force using read-only kubelet API [puppet] - 10https://gerrit.wikimedia.org/r/390264 (https://phabricator.wikimedia.org/T177395) [15:57:15] (03PS1) 10Alexandros Kosiaris: Prometheus: add kubernetes node cadvisor job [puppet] - 10https://gerrit.wikimedia.org/r/390267 [15:59:20] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:01:31] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:19] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy dns400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3716775 (10RobH) These are idling as role spare with the OS installed, ready for service. [16:06:30] (03PS1) 10ArielGlenn: add missing space in dump webserver rsync exclude args [puppet] - 10https://gerrit.wikimedia.org/r/390269 [16:07:26] (03PS2) 10ArielGlenn: add missing space in dump webserver rsync exclude args [puppet] - 10https://gerrit.wikimedia.org/r/390269 [16:08:00] (03CR) 10ArielGlenn: [C: 032] add missing space in dump webserver rsync exclude args [puppet] - 10https://gerrit.wikimedia.org/r/390269 (owner: 10ArielGlenn) [16:22:31] 10Puppet, 10Cloud-VPS: role::puppetmaster::standalone has no firewall rule for port 8140 - https://phabricator.wikimedia.org/T154150#3748333 (10aborrero) a:03aborrero [16:33:17] !log Restarting Cassandra, restbase2005-a.codfw.wmnet (T179419) [16:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:24] T179419: Migrate mathoid storage from legacy to new strategy - https://phabricator.wikimedia.org/T179419 [16:34:38] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational [16:36:29] PROBLEM - cassandra-a SSL 10.192.48.46:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:36:38] PROBLEM - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.46 and port 9042: Connection refused [16:37:29] RECOVERY - cassandra-a SSL 10.192.48.46:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-a valid until 2018-08-17 16:11:58 +0000 (expires in 280 days) [16:37:38] RECOVERY - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.46 port 9042 [16:45:25] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Possibly faulty BBU on analytics1029 - https://phabricator.wikimedia.org/T178742#3748365 (10RobH) So the determination of ordering new hardware for failed will have to also rely on budgeting and if analytics can run without this host or require it. @e... [16:45:43] (03PS2) 10Cmjohnson: Removing dns entry for decom server wmf3248 [dns] - 10https://gerrit.wikimedia.org/r/390065 [16:46:05] (03CR) 10Cmjohnson: [C: 032] Removing dns entry for decom server wmf3248 [dns] - 10https://gerrit.wikimedia.org/r/390065 (owner: 10Cmjohnson) [16:48:57] (03PS3) 10Reedy: Add hifwiktionary too labsdb.yaml [puppet] - 10https://gerrit.wikimedia.org/r/389555 (https://phabricator.wikimedia.org/T173643) [17:00:05] godog, moritzm, and _joe_: Dear deployers, time to do the Puppet SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171109T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:05] (03PS5) 10Giuseppe Lavagetto: puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [17:08:22] 10Operations, 10Traffic, 10netops: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748444 (10ayounsi) Another data point, https://grafana.wikimedia.org/dashboard/db/network-performances-global?orgId=1&from=1507633459013&to=1507680000000... [17:14:16] !log Restarting Cassandra, restbase2005-b.codfw.wmnet (T179419) [17:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:23] T179419: Migrate mathoid storage from legacy to new strategy - https://phabricator.wikimedia.org/T179419 [17:17:10] (03CR) 10Filippo Giunchedi: "Only nits, LGTM. I'll update https://gerrit.wikimedia.org/r/#/c/389930/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/390264 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [17:17:23] PROBLEM - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.47 and port 9042: Connection refused [17:18:02] PROBLEM - cassandra-b SSL 10.192.48.47:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:18:57] (03PS4) 10Filippo Giunchedi: profile: allow Prometheus to access k8s kubelet [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) [17:19:02] RECOVERY - cassandra-b SSL 10.192.48.47:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-b valid until 2018-08-17 16:11:59 +0000 (expires in 280 days) [17:19:23] RECOVERY - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.47 port 9042 [17:23:51] (03CR) 10Filippo Giunchedi: [WIP] First commit (039 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [17:24:10] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Possibly faulty BBU on analytics1029 - https://phabricator.wikimedia.org/T178742#3748503 (10RobH) Discussed some, Chris is going to pull a BBU out of a decom system to replace the defective one. Also analytics may have to start planning for the replac... [17:29:31] (03CR) 10Madhuvishy: [V: 032 C: 032] passwords: add labs key for arturo [labs/private] - 10https://gerrit.wikimedia.org/r/390027 (owner: 10Arturo Borrero Gonzalez) [17:33:33] (03PS7) 10Zoranzoki21: Enable the ArticlePlaceholder for Northern Sami (sewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [17:36:09] (03CR) 10Herron: puppet: conditionally pin packages to appropriate repo for puppet 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [17:38:08] I have a question about direct access to files uploaded to commons and eqiad cluster if anyone thinks they might be able to answer [17:39:09] 10Operations, 10Traffic, 10netops: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748534 (10BBlack) I'm pretty sure all of the TCP application-level data flows match up roughly with the expected sequence of TLS HANDSHAKE -> CLIENT HTTP... [17:41:06] bearloga: What sort of question? [17:42:24] (03PS3) 10Alexandros Kosiaris: prometheus: Force using read-only kubelet API [puppet] - 10https://gerrit.wikimedia.org/r/390264 (https://phabricator.wikimedia.org/T177395) [17:42:26] (03PS2) 10Alexandros Kosiaris: Prometheus: add kubernetes node cadvisor job [puppet] - 10https://gerrit.wikimedia.org/r/390267 [17:44:07] Reedy: if I wanted to grab a copy of some commons images directly from eqiad over sftp for a project instead of through the mediawiki interface over http, is that possible and if it is, would it even make sense to do that? [17:44:09] (03PS6) 10Herron: puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) [17:44:34] bearloga: You'd need to get them from swift, they're not directly accessible on disk [17:44:54] Reedy: ebernhardson had this to say in #wikimedia-discovery: " hmm, so media storage is all in a distributed filesystem called swift, iiuc. Pulling those images over http is probably not that big of a difference, especially since the varnish layer caches living above swift are probably excessively more performant than swift itself. so, imo, grabbing 10k images (or whatever) from http is probably just fine. On the other hand [17:44:54] if you want 25M images, we might have to think more" [17:45:15] Sounds vaguely right [17:45:26] (03CR) 10Alexandros Kosiaris: prometheus: Force using read-only kubelet API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/390264 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [17:45:27] You might be better opening a phab task, tagging operations and godog [17:45:37] Reedy: noted, thanks! [17:45:39] Depends what you want to do [17:45:50] You might be able to write a slim(er) SWIFT wrapper than MediaWiki [17:45:57] Well, you almost certainly can [17:46:01] It's whether it's worth the effort ;) [17:46:52] (03PS4) 10Alexandros Kosiaris: prometheus: Force using read-only kubelet API [puppet] - 10https://gerrit.wikimedia.org/r/390264 (https://phabricator.wikimedia.org/T177395) [17:46:54] (03PS3) 10Alexandros Kosiaris: Prometheus: add kubernetes node cadvisor job [puppet] - 10https://gerrit.wikimedia.org/r/390267 [17:47:52] bearloga: There's also ways of pulling avoiding the caches, to save "polluting" them etc [17:48:06] And/or checking to see if it's gonna be a HIT/MISS first, and then choose appropriately [17:49:38] Reedy: but wouldn't I want to download a cached image to not put unnecessary stress on the application layer? [17:50:04] Sure [17:50:12] My point being, if it's already in the cache, get it from there [17:50:17] If it's not... [17:50:22] Aahhhh [17:50:49] Reedy: sorry, I think I misread your original message on account of I still haven't had coffee [17:50:55] heh [17:51:13] I presume you want originals, not thumbs too? [17:53:53] Reedy: not necessarily original, but definitely bigger than thumb. I might be okay with a lower resolution version like a preview. [17:55:41] "thumbs" in mw generally refers to any size other than the original [17:56:06] bearloga: yeah what Reedy said re: phab, easier to keep track of. tl;dr though is that yes you can fetch images from inside the cluster directly from swift the same way varnish does [17:56:51] bawolff: ah! didn't know that! thanks [17:57:01] 10Operations, 10Analytics, 10DBA, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3748567 (10Nuria) @Tgr : i think your use case would work in hadoop as edit data related information is available since the beginning of time, now, it is true... [17:58:20] (03PS2) 10Ema: varnish: log slow requests [puppet] - 10https://gerrit.wikimedia.org/r/390258 [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171109T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:02:22] 10Operations, 10Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#3748576 (10kaythaney) thanks! i'll update that first thing next week. (please feel free to file tickets here and tag me, or ping me directly.) [18:04:07] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/8712/" [puppet] - 10https://gerrit.wikimedia.org/r/390258 (owner: 10Ema) [18:04:28] !log rolling restart of parsoid servers in codfw for 4.9.51 kernel update [18:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:39] 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): Created dedicated elastic component in our APT repository - https://phabricator.wikimedia.org/T179964#3748577 (10debt) [18:05:30] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: Force using read-only kubelet API [puppet] - 10https://gerrit.wikimedia.org/r/390264 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [18:05:56] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, modulo rebase" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/390267 (owner: 10Alexandros Kosiaris) [18:07:40] moritzm: should I give you some time before I deploy? [18:12:14] arlolra: oh, I can hold this until you're done. jouncebot said "No GERRIT patches in the queue for this window AFAICS", so I thought this didn't happen [18:12:24] so, please go ahead [18:13:06] 10Operations, 10Discovery-Search: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#3748595 (10debt) p:05Triage>03Low Let's take a look and fix this, then figure out who uses it and why. [18:13:13] herron: when you get a minute could you take another look at https://gerrit.wikimedia.org/r/#/c/388478/ and https://gerrit.wikimedia.org/r/#/c/388032/ ? thanks! [18:13:37] * ebernhardson looks for why search.wikimedia.org exists. Find it first mentioned in "Initial commit of public puppet repo". not that much help :P [18:13:53] ebernhardson: I believe, the answer is apple [18:13:58] godog sure! [18:14:15] ebernhardson: https://wikitech.wikimedia.org/wiki/Search.wikimedia.org [18:14:20] Reedy: do we think it matters that its issuing 500's? [18:14:23] moritzm: we generally don't queue up patches. there's a link to mw:Parsoid/Deployments where fresh deploys are waiting. sorry about that. [18:14:30] ebernhardson: Probably [18:14:41] Yeah [18:14:49] !log not rebooting parsoid hosts due to Services deployment window, instead rolling restart of mw2120-mw2139 for kernel update to 4.9.51 [18:14:51] search.wm.o is behind the text varnishes [18:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:53] We we know why? [18:14:55] https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/search.wikimedia.org/index.php [18:14:58] thus accounting to the 500s of those [18:15:09] "Wikimedia search service internal error. Unexpected result format." [18:15:17] which is at least adding noise to that data [18:15:56] arlolra: ok, good to know. please go ahead, I'll proceed with other servers for now [18:16:08] i can look into it and probably fix it, i can't imagine anything too crazy with a 100 line script [18:16:19] some missing/extra param passed to the api serach [18:16:25] moritzm: ty [18:16:28] !log arlolra@tin Started deploy [parsoid/deploy@d1c7386]: Updating Parsoid to 2887b5ad [18:16:31] basically, a curl to https://$lang.$site.org/w/api.php?action=opensearch&search=$urlSearch&limit=$limit [18:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:52] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_sysctl] [18:28:48] !log arlolra@tin Finished deploy [parsoid/deploy@d1c7386]: Updating Parsoid to 2887b5ad (duration: 12m 20s) [18:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:17] (03CR) 10Herron: [C: 031] mx: export metrics from exim4 mainlog [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [18:38:12] PROBLEM - Check systemd state on restbase2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:41:12] RECOVERY - Check systemd state on restbase2005 is OK: OK - running: The system is fully operational [18:41:16] !log Updated Parsoid to 2887b5ad (T178253, T173643, T176728, T180010, T171381, T179757) [18:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:29] T179757: Linter/tidy-font-bug triggers on empty font tags - https://phabricator.wikimedia.org/T179757 [18:41:29] T180010: Parsoid creates broken wikitext for link inside square brackets - https://phabricator.wikimedia.org/T180010 [18:41:29] T171381: Fix missing-end-tag linter issue generation - https://phabricator.wikimedia.org/T171381 [18:41:29] T173643: Create Wiktionary Fiji Hindi - https://phabricator.wikimedia.org/T173643 [18:41:29] T178253: Figure handler rejects nested tables in figure captions - https://phabricator.wikimedia.org/T178253 [18:50:28] ebernhardson, Reedy: Yes, tldr is apple search gateway. FWIW, I'm pretty sure recent versions of OSX don't actually call that anymore, they use api.php directly [18:50:36] But there's weird history there [18:50:52] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:51:05] I have wasted too many hours thinking about that stupid fucking thing [18:53:10] If its not used anymore lets just get rid of so we dont have to even think about it [18:53:18] We can't. [18:53:31] I said recent versions. Some people don't upgrade their operating system [18:53:55] If it was totally unused we wouldn't have noticed it was broken ;-) [18:54:36] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3748769 (10RobH) Swapped mainboard yesterday, but during the installer today got the following: [ 457.538179] BUG: soft lockup - CPU#19 stuck for 23s! [apt-get:38504] │ [ 493.53... [18:55:37] no_justification: oops yep i just saw that nm [18:59:50] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3748777 (10BBlack) So, looking at all the crash messages we've managed to record since the beginning of this ticket, the CPU# indicated has had a history of: 41, 23, 47, 47, 1, 19 . T... [18:59:54] (03PS1) 10ArielGlenn: correct a config file path for dumps cron listing last good dumps [puppet] - 10https://gerrit.wikimedia.org/r/390292 [19:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171109T1900). [19:00:04] ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:44] i can deploy [19:01:04] (03CR) 10ArielGlenn: [C: 032] correct a config file path for dumps cron listing last good dumps [puppet] - 10https://gerrit.wikimedia.org/r/390292 (owner: 10ArielGlenn) [19:01:34] * ebernhardson broke and fixed enwiki yesterday, do i get a sticker? :P [19:02:03] ebernhardson: yes link me to your home wiki [19:02:04] (actually it was probably all the group2 wikis ...) [19:02:14] ebernhardson: You can get a t-shirt from bd808 if you ask nicely (and haven't got one already :P) [19:02:22] Zppix: i can never remember where my page is .. [19:02:39] ebernhardson: well then no sticker for you xD [19:03:08] ebernhardson: I have stickers and greg-g should too. The t-shirts are somewhere at the new office. :) [19:03:14] sweet! [19:03:28] Is there a picture of these stickers? [19:03:31] I wanna see them [19:06:14] !log ebernhardson@tin Synchronized php-1.31.0-wmf.7/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: Turn off DBN sizing AB test (duration: 00m 51s) [19:06:18] (03CR) 10Herron: "I don't know enough about mtail yet to +1 but the approach and example logs look reasonable to me" [puppet] - 10https://gerrit.wikimedia.org/r/388478 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [19:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:42] Zppix: slightly blurry one on my laptop in Vienna -- https://upload.wikimedia.org/wikipedia/commons/9/90/Wikimedia_Hackathon_2017_IMG_4315_%2833946951783%29.jpg -- Its the black globe sticker just to the right of the COMMTECH knuckle tattoo sticker [19:07:18] SO MUCH STICKERS! :D [19:07:21] bd808: 10/10 [19:07:36] bd808: and where are you going to put new stickers jesus [19:07:44] Sagan: there are more now. [19:07:48] o.O [19:07:53] you just add more in layers ;) [19:08:08] sigh, thought i would cheat to find my user page with a global search. Apparently i have pages on 39 wikis :P [19:08:17] ebernhardson: only 39? [19:08:19] bd808: I hope you didn't cover the CommTech one. :P [19:08:34] Niharika: never! [19:08:36] Number of attached accounts: 864 [19:08:48] bd808: so whats the reason to renew your equiqment? too much stickers on the old laptop? :D [19:08:51] \o/ [19:08:58] ebernhardson: just give me the wiki you use the most xD [19:09:19] Sagan: the stickers caused overheat [19:09:22] xD [19:09:48] rule #1 for using stickers: don't cover the cooling [19:10:03] at least this is the ideal situation [19:10:28] Sagan: i cover all the cooling first then the motherboard then the outside lid [19:10:29] Zppix: it appears the one with information is funnily not my accont with (WMF) in it: https://en.wikipedia.org/wiki/User:Ebernhardson [19:10:46] this pic that halfak posted on twitter is also of my laptop and shows a few of the newer stickers -- https://twitter.com/halfak/status/907637717705662465 [19:11:42] bd808: ah, I see the "I broke wikipedia.. and the I fixed it!" sticker [19:12:39] ebernhardson: i sent you a "sticker" [19:13:34] bd808: Remind me to get a photo of the framed t-shirt [19:14:07] :) then I could have a framed photo of the framed shirt I made [19:14:16] bd808: your making me want to drive to wmf offices and beg for the stickers :/ thats a long drive from illinois [19:14:16] Reedy: I'm only at 745 attached accounts :( [19:14:36] I'm slacking! [19:14:44] no_justification: more than me [19:14:59] no_justification: I had a hacky script to create them on the closed wikis etc [19:15:28] I haven't run Timo's script in a while. Only 716 [19:15:54] Back in the day we opened new tabs on Special:Sitematrix [19:15:56] AND LIKED IT [19:17:02] I have 739 [19:18:54] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099#3748840 (10herron) 05Open>03stalled For the time being we're going to leave the hostcert setting alone and work around it during puppetmas... [19:18:56] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3748842 (10herron) [19:23:30] Hello. [19:23:52] bd808: fun you speak about stickers, I see this message on Mastodo: https://social.nasqueron.org/@deadsuperhero/98976097354452690 [19:26:10] I want stickers :( [19:29:46] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3748879 (10RobH) I've updated Dell, and they want me to move it to socket 1 and repeat. I'm asking them to just send me a replacement CPU, we'll see what happens. [19:41:59] 10Operations, 10Discovery-Search: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#3718779 (10EBernhardson) This is apparently https://wikitech.wikimedia.org/wiki/Search.wikimedia.org and we still need to maintain it. I took a sample of 10k failures (out of ~25k in... [19:46:41] (03CR) 10Lokal Profil: [C: 04-1] "minor detail otherwise this looks good. note that we also need to update the real config which is deployed" (032 comments) [dumps/dcat] - 10https://gerrit.wikimedia.org/r/386366 (owner: 10JakobVoss) [19:58:32] (03CR) 10Lokal Profil: [C: 04-1] "The real config lives at modules/snapshot/files/dcatconfig.json in the operations/puppet repo (unless it has moved since the last time)" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/386366 (owner: 10JakobVoss) [19:58:53] (03PS1) 10Chad: group2 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390302 [19:58:56] (03CR) 10Chad: [C: 032] group2 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390302 (owner: 10Chad) [20:00:04] no_justification: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171109T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:01:06] (03Merged) 10jenkins-bot: group2 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390302 (owner: 10Chad) [20:01:17] (03CR) 10jenkins-bot: group2 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390302 (owner: 10Chad) [20:08:14] 10Operations, 10Discovery-Search: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#3718779 (10demon) >>! In T179266#3748595, @debt wrote: > Let's take a look and fix this, then figure out who uses it and why. It's for Apple's dictionary bridge. As best I know, no *rec... [20:08:40] ebernhardson: Moar context ^ :) :( [20:09:54] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.7 [20:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:22] (03PS1) 10TerraCodes: Enable local uploads for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) [20:17:45] (03CR) 10TerraCodes: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [20:19:10] (03CR) 10Lokal Profil: [C: 04-1] "looks like it moved to modules/snapshot/files/cron/dcatconfig.json" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/386366 (owner: 10JakobVoss) [20:28:29] !log demon@tin Synchronized php-1.31.0-wmf.7/includes/libs/objectcache/WANObjectCache.php: less spammy error logs (duration: 00m 47s) [20:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:03] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3749116 (10RobH) They agreed and are dispatching a replacement part. I'll likely go ahead and do the proposed swap with the existing, but this will eliminate my having to make two tri... [20:52:52] 10Operations, 10Puppet: Update puppetmaster1001 puppet certificate - https://phabricator.wikimedia.org/T180167#3749164 (10herron) [20:53:36] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099#3749178 (10herron) T180167 created for the revoked puppetmaster1001 certificate [20:53:54] 10Operations, 10Puppet: Update puppetmaster1001 puppet certificate - https://phabricator.wikimedia.org/T180167#3749164 (10herron) [20:53:57] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3749182 (10herron) [20:56:13] (03PS1) 10Lokal Profil: [WIP]Support prefixed dump types [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390312 (https://phabricator.wikimedia.org/T163328) [21:16:10] 10Operations, 10Puppet: Update puppetmaster1001 puppet certificate - https://phabricator.wikimedia.org/T180167#3749217 (10herron) Puppetmaster1001 is not only a puppet master but the ca server so we need to be very cautious. Typically recreating an agent cert is along these lines: # on agent: `puppet agent... [21:17:50] 10Operations, 10Puppet: Update puppetmaster1001 puppet certificate - https://phabricator.wikimedia.org/T180167#3749221 (10herron) p:05Triage>03High [21:20:55] (03PS2) 10Herron: puppet: add puppet 4 auth.conf template [puppet] - 10https://gerrit.wikimedia.org/r/389720 (https://phabricator.wikimedia.org/T179722) [21:23:21] (03CR) 10Herron: [C: 032] puppet: add puppet 4 auth.conf template [puppet] - 10https://gerrit.wikimedia.org/r/389720 (https://phabricator.wikimedia.org/T179722) (owner: 10Herron) [21:23:26] (03PS3) 10Herron: puppet: add puppet 4 auth.conf template [puppet] - 10https://gerrit.wikimedia.org/r/389720 (https://phabricator.wikimedia.org/T179722) [21:31:28] 10Operations, 10Puppet: Update puppetmaster1001 puppet certificate - https://phabricator.wikimedia.org/T180167#3749248 (10herron) To reduce risk I think we should tackle this after depooling the eqiad puppet masters for upgrades [21:32:43] (03CR) 10Herron: "I think this is ready to merge but let's touch base on monday morning to confirm" [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [21:33:10] (03Abandoned) 10Herron: puppetmaster: temporarily pin puppet* to jessie-backports in codfw [puppet] - 10https://gerrit.wikimedia.org/r/386217 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [22:08:12] (03PS1) 10Ayounsi: [WIP] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 [22:15:13] (03Draft1) 10Paladox: puppetdb: Fix support for postgresql 9.6 [puppet] - 10https://gerrit.wikimedia.org/r/390332 [22:15:16] (03PS2) 10Paladox: puppetdb: Fix support for postgresql 9.6 [puppet] - 10https://gerrit.wikimedia.org/r/390332 [22:15:41] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: Fix support for postgresql 9.6 [puppet] - 10https://gerrit.wikimedia.org/r/390332 (owner: 10Paladox) [22:16:54] (03PS3) 10Paladox: puppetdb: Fix support for postgresql 9.6 [puppet] - 10https://gerrit.wikimedia.org/r/390332 [22:28:50] !log Decommissioning Cassandra, restbase2004-c.codfw.wmnet (T179422) [22:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:56] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [22:30:48] (03CR) 10Ayounsi: [C: 031] puppetdb: Fix support for postgresql 9.6 [puppet] - 10https://gerrit.wikimedia.org/r/390332 (owner: 10Paladox) [22:31:44] (03CR) 10Paladox: "Tested locally and works (as far as it goes onto the next stage and tells me puppetdb is not available). But there's no puppetdb in stretc" [puppet] - 10https://gerrit.wikimedia.org/r/390332 (owner: 10Paladox) [22:39:59] (03CR) 10Zoranzoki21: [C: 031] Enable local uploads for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes) [22:46:05] (03CR) 10Chad: [V: 032 C: 032] git.wikimedia.org -> phab [debs/logster] - 10https://gerrit.wikimedia.org/r/390175 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [22:46:11] (03CR) 10Chad: [V: 032 C: 032] git.wikimedia.org -> phab [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/390050 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [23:23:38] 10Operations, 10Traffic, 10netops, 10Cloud-VPS (Quota-requests): Request increased quota for traffic Cloud VPS project - https://phabricator.wikimedia.org/T180178#3749534 (10ayounsi) [23:24:13] PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 2754 MB (3% inode=98%) [23:35:41] 10Operations, 10Discovery-Search: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#3749569 (10EBernhardson) Pulled some info on overall usage and http response codes from webrequest logs. This is for oct 9 through nov 9 for all requests with host search.wikimedia.org.... [23:40:54] 10Operations, 10Discovery-Search: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#3749577 (10demon) Do we also have breakdowns by site param? Right now we allow wikipedia, wiktionary, wikinews and wikisource. Do all 4 projects get results? Are these all **exclusively*... [23:44:53] (03PS1) 10Chad: search.wikimedia.org: simplify limit handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390347 [23:45:28] (03PS2) 10Chad: search.wikimedia.org: simplify limit handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390347 (https://phabricator.wikimedia.org/T179266) [23:46:03] 10Operations, 10Cloud-VPS, 10Traffic, 10netops: Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749582 (10ayounsi) [23:46:21] (03CR) 10jerkins-bot: [V: 04-1] search.wikimedia.org: simplify limit handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390347 (https://phabricator.wikimedia.org/T179266) (owner: 10Chad) [23:46:47] (03CR) 10Reedy: [C: 04-1] search.wikimedia.org: simplify limit handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390347 (https://phabricator.wikimedia.org/T179266) (owner: 10Chad) [23:50:43] (03PS3) 10Chad: search.wikimedia.org: simplify limit handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390347 (https://phabricator.wikimedia.org/T179266) [23:57:21] 10Operations, 10Discovery-Search, 10Patch-For-Review: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#3749634 (10EBernhardson) If needed i can pull a full month, but will take longer. This is for nov 8th (UTC). This is also limited to requests that returned a 200 re... [23:58:08] 10Operations, 10Cloud-VPS, 10Traffic, 10netops: Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749582 (10madhuvishy) Noting here that proprietary software is not usually installed on WMCS environments per https://wikitech.wikimedia.org/wiki/Wikitech...