[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180124T0000). [00:00:04] MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:03:14] so, any deployers around? [00:04:08] (03CR) 10Dzahn: "compiles now after https://gerrit.wikimedia.org/r/#/c/406001/" [puppet] - 10https://gerrit.wikimedia.org/r/405990 (owner: 10Dzahn) [00:04:42] (03CR) 10Dzahn: [C: 032] wmcs::puppetmaster: move standard/firewall include to roles [puppet] - 10https://gerrit.wikimedia.org/r/405990 (owner: 10Dzahn) [00:04:57] (03PS2) 10Dzahn: wmcs::puppetmaster: move standard/firewall include to roles [puppet] - 10https://gerrit.wikimedia.org/r/405990 [00:08:26] !log bootstrapping restbase2007-a - T184100 [00:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:41] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [00:11:21] :( [00:11:37] James_F: find me a deployer? [00:13:03] MatmaRex: Probably not. Bit chaötic here. :-( [00:14:40] MatmaRex: matt_flaschen says he'll do it. [00:14:44] (He's awesome.) [00:16:03] whoo [00:16:36] Okay, reviewing patches now. [00:17:27] Just one. Everyone must be busy somewhere (yeah, I'm also at Dev Summit). [00:20:20] (03PS1) 10Dzahn: openstack::main: move standard/firewall includes to roles [puppet] - 10https://gerrit.wikimedia.org/r/406003 [00:21:13] MatmaRex: are you only not a deployer yourself by choice? [00:21:47] * bd808 feels like he has asked this before [00:21:56] bd808: mostly yes [00:27:19] (03PS2) 10Dzahn: openstack::main: move standard/firewall includes to roles [puppet] - 10https://gerrit.wikimedia.org/r/406003 [00:28:22] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9811/" [puppet] - 10https://gerrit.wikimedia.org/r/406003 (owner: 10Dzahn) [00:28:58] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [00:29:57] (03CR) 10Mobrovac: [C: 04-1] "One minor thing, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [00:31:18] * matt_flaschen sighs [00:31:25] "800 | WARNING | Line exceeds 100 characters; contains 102 characters" [00:31:28] https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/33584/console [00:31:50] ^ James_F [00:32:07] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:32:31] I guess the rule is different on master, or recently changed. [00:32:48] That's… pretty unhelpful. [00:32:53] matt_flaschen: uhh, no [00:32:54] MatmaRex: ^^^ [00:33:00] <_joe_> anyone doing things on rb1011? [00:33:03] matt_flaschen: that looks like the error output for a different patch [00:33:39] we're not even touching that file [00:34:56] Yeah, that's the failure for 405993. [00:35:07] matt_flaschen: https://gerrit.wikimedia.org/r/#/c/405933/ just merged [00:35:07] The SWAT is for 405933. [00:35:09] MatmaRex, oops, sorry. [00:35:15] Double-3 not double-9. :-) [00:39:58] !log mobrovac@tin Started deploy [zotero/translators@8f53531]: Update translators to 528296d [00:40:06] !log mobrovac@tin Finished deploy [zotero/translators@8f53531]: Update translators to 528296d (duration: 00m 08s) [00:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:00] James_F, please test: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Staging_changes [00:44:04] On mwdebug1002 [00:44:21] MatmaRex: ^^ [00:44:55] ooking [00:44:58] looking* [00:51:42] please hold on. firefox's debugger is making me cry [00:54:21] matt_flaschen: are you sure the code is live? [00:55:21] MatmaRex, only on mwdebug1002. ^ [00:55:44] matt_flaschen: i'm looking at https://en.wikipedia.org/w/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js on mwdebug1002 and it's showing the old code [00:56:23] it doesn't have the "Support: Firefox =< 52" line [00:56:29] MatmaRex, ahg, I forgot to update the submodule. Really sorry. [00:56:31] matt_flaschen: are you seeing the new version there? [00:56:51] aha. no problem :) [00:57:09] MatmaRex: matt_flaschen was just making sure you were actually testing. ;-) [00:58:57] MatmaRex, it's really on mwdebug1002 now. Sorry again. [00:59:24] matt_flaschen: yeah. works now :) [01:02:41] !log mattflaschen@tin Synchronized php-1.31.0-wmf.17/extensions/VisualEditor/: (no justification provided) (duration: 00m 58s) [01:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:54] thanks matt_flaschen [01:08:53] MatmaRex, no problem. Please test without mwdebug1002. [01:09:38] i did and it's working [01:09:47] MatmaRex, thanks. [01:10:00] !log Deployed 'T185304: NWE: Don't attempt to set selection on unattached textarea' in extensions/VisualEditor [01:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:12] T185304: NWE doesn't load in Firefox ESR - https://phabricator.wikimedia.org/T185304 [01:14:53] !log SWAT complete [01:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:31] matt_flaschen: You're awesome, thank you. [01:16:59] James_F, you're welcome. [01:25:18] (03PS1) 10Ayounsi: Assigning eqsin PCCW peering IPs [dns] - 10https://gerrit.wikimedia.org/r/406011 [01:27:29] (03CR) 10Ayounsi: [C: 032] Assigning eqsin PCCW peering IPs [dns] - 10https://gerrit.wikimedia.org/r/406011 (owner: 10Ayounsi) [02:22:45] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.17) (duration: 05m 33s) [02:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:37] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 769.42 seconds [03:56:47] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 207.84 seconds [04:11:49] (03Draft2) 10Jayprakash12345: Allow bureaucrats to add/remove 'accountcreator' permission on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406012 (https://phabricator.wikimedia.org/T185597) [04:12:56] (03CR) 10Jayprakash12345: "Please review deeply." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406012 (https://phabricator.wikimedia.org/T185597) (owner: 10Jayprakash12345) [04:57:27] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=61%) [06:18:57] RECOVERY - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.176 port 9042 [06:24:54] (03CR) 10Brian Wolff: "Wouldnt url-downloader.wikimedia.org block tool labs or is my knowledge on how this all works outdated?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404653 (https://phabricator.wikimedia.org/T185087) (owner: 10Aklapper) [06:26:20] !log bootstrapping restbase2007-b - T184100 [06:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:34] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [06:27:27] RECOVERY - cassandra-b SSL 10.192.16.177:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-b valid until 2018-08-17 16:12:09 +0000 (expires in 205 days) [06:36:08] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:37:07] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%) [06:46:17] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [06:47:17] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [07:11:37] PROBLEM - HHVM rendering on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [07:11:57] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [07:12:08] PROBLEM - Apache HTTP on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [07:12:37] RECOVERY - HHVM rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 74539 bytes in 0.188 second response time [07:12:57] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.044 second response time [07:13:08] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.215 second response time [07:19:37] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [07:21:27] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:27:38] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 26 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:32:37] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 13 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:49:58] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [07:50:57] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [08:03:02] cp4024 seems throwing 503s.. it happens daily now in ulsfo [08:14:37] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [08:14:37] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:16:32] !log cp4024: restart varnish-be due to 503s [08:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:17] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [08:28:38] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [08:29:38] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:34:14] (03CR) 10TerraCodes: [C: 031] Allow bureaucrats to add/remove 'accountcreator' permission on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406012 (https://phabricator.wikimedia.org/T185597) (owner: 10Jayprakash12345) [08:37:43] (03PS49) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [10:17:33] (03PS1) 10Jon Harald Søby: Set category collation for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406022 (https://phabricator.wikimedia.org/T185630) [10:19:23] (03CR) 10Jon Harald Søby: "When this is merged, the deployer needs to run the following command:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406022 (https://phabricator.wikimedia.org/T185630) (owner: 10Jon Harald Søby) [11:27:14] marostegui: jynus -- I'm about to perform a bigdelete on dewiki, page has 5 088 revids; be adviced [11:28:12] I'll use API, it's faster [11:49:54] (03PS1) 10MarcoAurelio: Bureaucrats on WMF wikis to add and remove 'accountcreator' by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406025 (https://phabricator.wikimedia.org/T185417) [12:33:27] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:57:49] (03PS4) 10Paladox: ircecho: Support ssl when connecting to irc [puppet] - 10https://gerrit.wikimedia.org/r/405591 [13:07:45] (03PS5) 10Paladox: ircecho: Support ssl when connecting to irc [puppet] - 10https://gerrit.wikimedia.org/r/405591 [13:07:59] (03CR) 10Paladox: ircecho: Support ssl when connecting to irc (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405591 (owner: 10Paladox) [13:08:16] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Support ssl when connecting to irc [puppet] - 10https://gerrit.wikimedia.org/r/405591 (owner: 10Paladox) [13:09:42] (03CR) 10Paladox: ircecho: Support ssl when connecting to irc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/405591 (owner: 10Paladox) [13:12:32] (03PS6) 10Paladox: ircecho: Support ssl when connecting to irc [puppet] - 10https://gerrit.wikimedia.org/r/405591 [13:12:59] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Support ssl when connecting to irc [puppet] - 10https://gerrit.wikimedia.org/r/405591 (owner: 10Paladox) [13:17:17] (03PS7) 10Paladox: ircecho: Support ssl when connecting to irc [puppet] - 10https://gerrit.wikimedia.org/r/405591 [13:42:04] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3921750 (10Gilles) Revisiting[[ https://logstash.wikimedia.org/goto/788ca720a38ccbed8dab29adab7ac2ca | the link I posted previously ]], graph for the past 2... [13:42:13] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3921751 (10Gilles) a:05Gilles>03None [13:42:57] RECOVERY - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.177 port 9042 [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180124T1400). [14:00:04] Jhs and Jayprakash12345: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:03:26] zeljkof:Hello [14:09:30] Who will swat? [14:15:59] ahoy, i'm here for swat if i'm not too late [14:16:07] my IRC client was acting up and i didn't notice until now [14:17:05] me too [14:17:42] let's ping zeljkof for good measure ;) [14:19:13] zeljkof: Hello [14:19:42] hmm, i think he might be in San Francisco for the all-hands thing, so probably not awake [14:20:46] aude, perhaps? [14:25:32] hashar: are you here? [14:31:26] MaxSem: ping? [14:32:09] aude: ping? [14:32:50] no_justification: ping? [14:39:29] 6:40 am, it's going to be a hard sell to find sf folks on line now [15:29:42] <_joe_> apergos: ping [15:29:45] <_joe_> :P [15:30:25] _joe_: ponnngggg [15:32:28] 10Operations, 10media-storage: upload.wikimedia.org reports wrong mimetype for svg - https://phabricator.wikimedia.org/T179787#3921821 (10zhuyifei1999) [15:34:43] I need to get off the 6am swat I never ever do it (if I'm even awake) [15:35:25] Good thing I sleep with my phone on mute [15:50:57] (03CR) 10Aklapper: "@Brian: I don't get the question and how it is related, sorry. Could you elaborate?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404653 (https://phabricator.wikimedia.org/T185087) (owner: 10Aklapper) [16:10:48] no_justification: to simplify tracking we just have it ping every swatter each slot :) [16:11:23] Maybe I should just resign as a swatter 😉 [16:11:46] * greg-g glares [16:11:48] :) [16:14:58] greg-g: well Im usually last to volunteer anyway 😂 [16:48:58] 10Operations, 10ops-eqiad, 10Analytics-Kanban: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3921916 (10RobH) This is an older R720xd, and uses an older H710 controller. While @Cmjohnson can check for a spare when back onsite, there is a good chance we don't have any. If... [16:59:51] 10Operations, 10Phabricator: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644#3921927 (10Paladox) [17:01:30] 10Operations, 10Gerrit: Switch gerrit from using apache to nginx - https://phabricator.wikimedia.org/T185645#3921946 (10Paladox) [17:08:26] 10Operations, 10Phabricator: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644#3921962 (10Paladox) [17:08:35] 10Operations, 10Gerrit: Switch gerrit from using apache to nginx - https://phabricator.wikimedia.org/T185645#3921963 (10Paladox) [17:15:34] ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia [17:15:34] title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL [17:15:34] itle from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retri eevans Not production [17:15:34] ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia [17:15:34] title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL [17:15:34] itle from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retri eevans Not production [17:15:34] ACKNOWLEDGEMENT - MD RAID on restbase-dev1006 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 eevans Not production [17:15:35] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.168 and port 9042: Connection refused eevans Not production [17:15:35] ACKNOWLEDGEMENT - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues eevans Not production [17:15:36] ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia [17:15:36] title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL [17:15:37] itle from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retri eevans Not production [17:16:00] ACKNOWLEDGEMENT - IPsec on mc1036 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2036_v4 Giuseppe Lavagetto mc2036 is down for hardware repair, the IPSEC is going to be broken until that comes up again. [17:19:32] ACKNOWLEDGEMENT - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans Investigating error (but cassandra-metrics-collector is not in use, and its failure isnt degrading) [17:22:46] (03Abandoned) 10MarcoAurelio: Add gwtoolset to GlobalGroupPermissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395806 (owner: 10MarcoAurelio) [17:43:40] (03PS13) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) [17:51:18] (03PS5) 10MarcoAurelio: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405421 (https://phabricator.wikimedia.org/T143789) [17:54:00] gerrit seems to be broken [17:54:06] can't do "git review" [17:54:19] same for other people in the office [17:54:49] Amir1 what's the error please? [17:55:57] Amir1: I was wondering if it was just me and the weird wifi I'm on. [17:55:59] Amir1 are you using git-review version 1.26.0? [17:56:25] You need to be on 1.26.0 to prevent a bug from comming up where it does /changes/ instead of /r/changes/ [17:57:05] It took a *very* long time just to do a clone of a very small new repo [17:57:25] That would be your internet connection :) [17:57:49] paladox: I'm not sure. direct ssh and traceroute are looking good [17:57:58] just git operations are slow [17:58:05] bd808 ah [17:58:10] operations is a very large repo [17:58:21] it is the biggest out of all repos and tieing close to mw. [17:59:02] bd808 maybe we should gc it as it is slow for me too. [17:59:04] no_justification ^^ [17:59:26] huh, git pull still pending... [17:59:38] (usually it answers within 30 seconds, now its taking longer) [17:59:40] paladox: not the operations repo, the act of using git via ssh to clone and update. The exact repo I was hitting is https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/libs/ObjectFactory [17:59:41] Hmm it's not pulling for me. [17:59:58] of course everyone went to git pull when amir reported the error so now its really borked ;D [18:00:02] bd808 ah, well the operations repo isen't cloning for me. (updating i mean) [18:00:04] no_justification: ^ [18:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180124T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:18] * paladox creates a task [18:00:22] Don't bother [18:00:23] I'll repack it [18:00:26] Probably won't help [18:00:30] It's a /big/ repo [18:00:36] git pull is still just sitting for me heh [18:00:42] I am here [18:01:49] no_justification it seems git cloning a small repo is not working. [18:02:03] yeah git pull on operations/dns is also not working [18:02:05] and its a very tiny repo [18:02:08] Well I'm already here, so still don't file a task yet [18:02:08] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922041 (10Paladox) [18:02:23] also just not git pull, not clone... [18:02:24] no_justification oh i just saw that i will close it then. [18:02:35] so should I not be attempting to do a regular pull right now? [18:02:43] apergos: i tried and its not working [18:02:47] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922051 (10Paladox) 05Open>03Invalid Chad's dealing with it on irc. [18:02:47] right [18:02:57] Who can SWAT today? [18:03:34] gits not working so swat is likely going to get held up [18:03:51] rephrase: our git/gerrit instance is having issues. [18:03:58] Cloned dns instantly [18:04:04] huh [18:04:24] Y'all using SSH? [18:04:28] * no_justification doesn't [18:04:35] yes [18:04:43] no_justification ssh [18:04:46] SSH pool might be exausted. [18:04:58] Would explain why my maintenance commands won't complete [18:05:02] how do we go about fixing? [18:05:04] And why puppet's not freaking out everywhere [18:05:24] Well everyone should ctrl+c out of their clones they're doing for "testing" [18:05:29] For starters ;-) [18:06:59] Aaron has had a push open since Jan 18th. [18:07:00] Wtf. [18:07:04] Something got stuck [18:08:18] Amir1: You've got 3 fetches open to Wikibase...? [18:08:47] Er, or pushes. I always get upload-pack and receive-pack backards [18:12:21] Let me know if SWAT member ready [18:14:17] yea, also using ssh. i'm not going to try though since you said stop testing :) [18:14:25] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922041 (10Zoranzoki21) All is ok. For me in Serbia work without problems. [18:14:29] didnt notice an issue earlier [18:14:45] People should use https instead ;-) [18:15:48] i think i even changed that _from_ https to ssh because of some reason [18:15:57] ok [18:16:00] are the current issues fixed, should we kick it once ? [18:16:17] because of git-review [18:16:24] yea, that [18:16:26] now that the issue is fixed in 1.26.0 [18:16:37] i can change it bac?, yes, i am doing that [18:16:47] Ok. [18:17:01] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922073 (10Paladox) Using ssh? [18:17:30] /r/ or /r/p/ heh [18:17:44] no_justification i wonder do we want to in crease sshd.threads ? [18:17:57] Yes, but not blindly. [18:18:02] I want to revisit all of the thread pool settings [18:18:07] And base them in Science! [18:18:39] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922077 (10Zoranzoki21) >>! In T185649#3922073, @Paladox wrote: > Using ssh? With all avaiable options for cloning (anon https, noanon https and ssh) [18:19:09] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922078 (10Paladox) Hmm strange as it's affecting us and uk users over ssh. [18:19:29] Ok :) [18:19:47] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922080 (10demon) [18:20:47] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922085 (10Zoranzoki21) >>! In T185649#3922078, @Paladox wrote: > Hmm strange as it's affecting us and uk users over ssh. Note: I using internet from provider telenor. [18:21:20] Git clone was not woking in india sometime before. But Now It is working [18:23:31] so to those who use ssh. in your .git/config in a repo, a line with "url = ssh://" what you do is change the protocol to https://, remove the port number at the end and change the URL to add "/r/p/" before the repo name [18:23:35] example: url = https://gerrit.wikimedia.org/r/p/operations/dns.git [18:24:05] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922041 (10Lucas_Werkmeister_WMDE) We were experiencing the same problem in the Wikimedia Germany office, but it seems to be working again now (since 18:17 UTC). [18:24:06] and then it works fine [18:24:45] just dont forget to also remove that port :29418 [18:25:18] 10Operations, 10Gerrit: git cloning is not working with gerrit repos - https://phabricator.wikimedia.org/T185649#3922089 (10Zoranzoki21) I have not had a problem for a long time with this. [18:28:33] (03PS1) 10Chad: Gerrit: Remove arbitrary SSH thread settings [puppet] - 10https://gerrit.wikimedia.org/r/406052 (https://phabricator.wikimedia.org/T182756) [18:28:52] robh, _joe_ ^^ [18:29:01] yep, reviewing now [18:29:20] so removing it will default it back to the math you place in the commit msg? [18:29:51] im not sure of that behavior but it seems sensible to me... also if it doesnt work we can just revert and unbreak gerrit [18:29:58] and this only affects gerrit so seems ok to me to merge [18:30:22] no_justification: shall i just go ahead and merge, then we can kick puppet on cobalt and test? [18:31:08] (03CR) 10RobH: [C: 032] Gerrit: Remove arbitrary SSH thread settings [puppet] - 10https://gerrit.wikimedia.org/r/406052 (https://phabricator.wikimedia.org/T182756) (owner: 10Chad) [18:31:24] i meant to +1 that not +2... [18:31:36] no_justification: i assume this should wiat for _joe_ as well? [18:31:38] Yeah, needs a merge, puppet run on cobalt & gerrit2001 [18:31:42] Then I'll restart gerrit [18:31:45] (03CR) 10RobH: [C: 031] Gerrit: Remove arbitrary SSH thread settings [puppet] - 10https://gerrit.wikimedia.org/r/406052 (https://phabricator.wikimedia.org/T182756) (owner: 10Chad) [18:31:54] ok, ill go ahead and merge then [18:31:56] robh: If you're ok with it. Just pinged _joe_ too cuz he asked in another channel :) [18:32:07] im cool with it since itll only break gerrit anyhow [18:32:09] and we're actively working on it [18:32:12] (03CR) 10RobH: [C: 032] Gerrit: Remove arbitrary SSH thread settings [puppet] - 10https://gerrit.wikimedia.org/r/406052 (https://phabricator.wikimedia.org/T182756) (owner: 10Chad) [18:32:30] do it :) [18:32:47] !log bootstrapping restbase2007-c - T184100 [18:32:48] iirc, those settings came because the original defaults way back when were batshit [18:32:52] (03CR) 10Faidon Liambotis: [C: 031] "WFM, although I would have picked (52 + 52/4) * 7 * 24 = 10920h instead :)" [puppet] - 10https://gerrit.wikimedia.org/r/404434 (https://phabricator.wikimedia.org/T160677) (owner: 10Filippo Giunchedi) [18:32:55] defaualts sounds sane [18:32:57] for now [18:32:59] But I've honestly not touched them in agesssssss [18:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:01] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [18:33:07] Gerrit mostly chooses sane defaults for thread pools and such [18:33:07] but better than ancient settings [18:33:13] (httpd being the one exception) [18:33:17] *nod* [18:33:47] ok, merged and puppet is running on cobalt and gerrit2001 [18:33:48] RECOVERY - cassandra-c service on restbase2007 is OK: OK - cassandra-c is active [18:33:48] RECOVERY - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-c valid until 2018-08-17 16:12:10 +0000 (expires in 204 days) [18:33:57] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [18:33:58] no_justification: cobalt done and saw the update [18:34:08] RECOVERY - cassandra-b service on restbase2007 is OK: OK - cassandra-b is active [18:34:11] same with gerrit2001 [18:34:17] Ok, I'll restart services on both (2001 then cobalt) [18:34:17] RECOVERY - puppet last run on restbase2007 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:34:56] Ok, gerrit2001 succeeded (in failing, known issue) [18:34:57] :P [18:35:03] ;) [18:35:16] !log gerrit: restarting services, will be back momentarily [18:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:37] Timing :) [18:35:59] Back! [18:36:50] Jayprakash12345_: im kind of surprised that swat was scheduled this week at all [18:37:03] this week is the dev summit and all hands [18:38:18] so are git pulls back in business? [18:38:50] They were before the fix. I cleared the pool and had robh test while it was empty [18:39:00] This will keep people from clogging it up again [18:39:04] (it's usually an accident) [18:41:08] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:42:11] no_justification: thanks, I was suggesting doing that, at least until a more permanent decision is taken [18:44:56] jynus: More permanent...? [18:45:52] (ticket) [18:46:07] Ahhhh, ok. No worries [18:46:14] Turns out it was a dupe of a private task we already had [18:46:25] cool [18:46:30] tldr: gerrit's thread pool too tiny. So we upped it [18:46:31] :) [18:46:44] I added a question for more long-term [18:50:09] jynus: Responding. I have a long answer :) [18:56:27] I answered too regarding the one we block [18:57:06] proxy purchases for codfw is going slow because there are many other blockers and we are not 100% sure of the final infrastructure [18:57:20] *infrastructure's architecture [18:59:16] Good to know! [19:01:29] nice! [19:03:41] but we should unblock it yes or yes by EoQ [19:04:01] it may need app changes, though, like hiera-fication of port numbers [19:04:28] (not only for gerrit, for all misc services using databases) [19:06:49] 10Operations, 10Gerrit: Switch gerrit from using apache to nginx - https://phabricator.wikimedia.org/T185645#3922220 (10demon) 05Open>03declined Apache uses less than 100MB of memory on average, plus Gerrit's heap has plenty of space right now--we don't need the memory headroom. Also, I'd rather move it b... [19:09:17] jynus: Like I said on the task: it'd be *awesome* to offload Puppet's git::clone{} operations to a slave. It's usually only a few seconds behind the master, and those operations are all read-only anyway :) [19:10:36] <_joe_> no_justification: uhm, that would merit its own task [19:11:07] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:11:12] <_joe_> also, I might take the time to rewrite that bowl of puppet tech debt as a proper resource [19:11:25] <_joe_> no_justification: but git::clone goes via https IIRC [19:11:29] jynus no_justification also account data is partially stored in a git repo in the next update. So that will be important to get that synced too. [19:11:55] All repos are sync'd [19:12:25] _joe_: Yes, it does :) [19:12:49] <_joe_> so the threadpool issue is not just about ssh [19:13:24] but? [19:14:04] The thread pool for HTTP is wayyyyyy larger [19:14:29] maxThreads = 60 [19:14:34] (minThreads = 10) [19:15:37] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [19:16:37] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [19:18:58] (03PS1) 10Jcrespo: mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406060 [19:23:35] (03PS1) 10Dzahn: rename piwik::server to just piwik [puppet] - 10https://gerrit.wikimedia.org/r/406061 [19:24:38] really using the "http password" setting to upload via https then [19:24:41] (03PS1) 10Jcrespo: mariadb: Reimage es2011 into stretch/MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/406062 [19:25:25] I will be doing some minimal mediawiki deployments today [19:25:28] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [19:28:27] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406060 (owner: 10Jcrespo) [19:29:59] (03Merged) 10jenkins-bot: mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406060 (owner: 10Jcrespo) [19:30:16] (03CR) 10jenkins-bot: mariadb: Depool es2011 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406060 (owner: 10Jcrespo) [19:38:58] PROBLEM - Check Varnish expiry mailbox lag on cp4025 is CRITICAL: CRITICAL: expiry mailbox lag is 2016246 [19:41:52] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2011 (duration: 00m 57s) [19:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:38] h/go rele [19:57:10] !log starting es2011 reimage [19:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:46] 10Operations, 10ops-esams, 10Traffic, 10hardware-requests: Procure and install LVS and miscellaneous servers - https://phabricator.wikimedia.org/T184068#3922310 (10RobH) a:03BBlack @BBlack, Can I get some feedback on how many lvs systems we'll need in esams going forward for this replacement? I've assu... [20:46:28] (03CR) 10Imarlier: [C: 031] prometheus: bump global retention to 15 months [puppet] - 10https://gerrit.wikimedia.org/r/404434 (https://phabricator.wikimedia.org/T160677) (owner: 10Filippo Giunchedi) [20:49:39] 10Operations, 10ops-esams, 10Traffic, 10hardware-requests: Procure and install LVS and miscellaneous servers - https://phabricator.wikimedia.org/T184068#3922373 (10BBlack) a:05BBlack>03RobH Basically, for the cache-only sites the setup we've recently purchased for ulsfo + eqsin applies for esams refres... [21:02:28] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage es2011 into stretch/MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/406062 (owner: 10Jcrespo) [21:11:43] 10Operations, 10Analytics-Data-Quality, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3922415 (10BBlack) Regarding the accuracy/interpretation of `response_size`: it is based on varnishncsa's `%b`, which is the amount of HTTP bod... [21:24:43] 10Operations, 10TechCom, 10Services (attic), 10User-mobrovac: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#3922428 (10Krinkle) 05stalled>03Open [21:26:28] PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:27] RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 75335 bytes in 0.315 second response time [21:29:35] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#2766226 (10jcrespo) I think I am suffering this, but not on first boot, but on installer (both jessie and stretch). This doesn't happen with regular dbs, but these have 20TB, which may affect it being extr... [21:52:11] (03PS1) 10Jcrespo: mariadb-partman: Modify recipe, to test with es2011 [puppet] - 10https://gerrit.wikimedia.org/r/406114 [21:53:28] (03CR) 10Jcrespo: [C: 032] mariadb-partman: Modify recipe, to test with es2011 [puppet] - 10https://gerrit.wikimedia.org/r/406114 (owner: 10Jcrespo) [21:56:38] If anyone is around that can reset my 2fa on phab that would be great! :) (not urgent) I'll just keep posting here every now and again fishing for someone! [21:57:08] PROBLEM - HHVM rendering on mw1281 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [21:57:17] PROBLEM - Apache HTTP on mw1281 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [21:58:08] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 75319 bytes in 0.127 second response time [21:58:17] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.217 second response time [22:08:27] (03PS1) 10Jcrespo: Revert "mariadb: Reimage es2011 into stretch/MariaDB 10.1" [puppet] - 10https://gerrit.wikimedia.org/r/406117 [22:08:33] (03PS2) 10Jcrespo: Revert "mariadb: Reimage es2011 into stretch/MariaDB 10.1" [puppet] - 10https://gerrit.wikimedia.org/r/406117 [22:15:02] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Reimage es2011 into stretch/MariaDB 10.1" [puppet] - 10https://gerrit.wikimedia.org/r/406117 (owner: 10Jcrespo) [22:17:06] addshore: how can I know it's you? (I don't think i have access anyway...) [22:17:31] hehe, im in the office next to thcipriani and Reedy if that helps ;) [22:17:43] greg-g: addshore is the worst [22:18:02] nice! hope to se eyou tonight! [22:18:40] https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Access_list chad and mukunda, or a root root :) [22:19:20] addshore: still around? and if so, are you in the office? [22:19:50] apergos: indeed and indeed! [22:20:00] I am in the kitchen / by the kitchen / whatever this place is called [22:20:10] duebner lounge [22:20:55] I found the docs :D https://wikitech.wikimedia.org/wiki/Phabricator#Removing_Two_Factor_Authentication [22:21:12] (03PS1) 10Jcrespo: partman: Add temporary hack to test es2011 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/406121 [22:22:04] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#3922480 (10jcrespo) My issue could be an installer one, so ignore my latest comment. [22:23:04] (03CR) 10Jcrespo: [C: 032] partman: Add temporary hack to test es2011 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/406121 (owner: 10Jcrespo) [22:30:43] 10Operations, 10Analytics: setup/install evenlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922494 (10RobH) p:05Triage>03Normal [22:35:48] 10Operations, 10ops-eqiad: apply hostname labels to eventlog1001/WMF4751 - https://phabricator.wikimedia.org/T185668#3922507 (10RobH) p:05Triage>03Normal [22:43:35] (03PS1) 10RobH: setting dns entries for eventlog1002 [dns] - 10https://gerrit.wikimedia.org/r/406127 (https://phabricator.wikimedia.org/T185667) [22:46:09] (03PS2) 10RobH: setting dns entries for eventlog1002 [dns] - 10https://gerrit.wikimedia.org/r/406127 (https://phabricator.wikimedia.org/T185667) [22:47:12] (03CR) 10RobH: [C: 032] setting dns entries for eventlog1002 [dns] - 10https://gerrit.wikimedia.org/r/406127 (https://phabricator.wikimedia.org/T185667) (owner: 10RobH) [22:47:48] 10Operations, 10Analytics, 10Patch-For-Review: setup/install evenlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922545 (10RobH) [22:48:25] (03PS1) 10Addshore: Add 'RevisionStore' to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406128 [22:48:49] 10Operations, 10Analytics, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#3922546 (10RobH) 05Open>03Resolved T185667 has been created to track the setup of eventlog1002. Resolving this #hw-request. [22:56:46] 10Operations, 10Analytics, 10Patch-For-Review: setup/install evenlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922558 (10RobH) [23:00:59] 10Operations, 10Analytics, 10Patch-For-Review: setup/install evenlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922561 (10RobH) a:05RobH>03Ottomata @Ottomata: eventlog1001 is trusty. Can eventlog1002 be stretch or does it need to be an older distro? Please advise and assign back to me... [23:08:36] (03PS2) 10RobH: adding new shell user Ramsey Isler [puppet] - 10https://gerrit.wikimedia.org/r/405981 (https://phabricator.wikimedia.org/T185356) [23:08:38] (03PS2) 10RobH: adding Ramsey Isler to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/405982 (https://phabricator.wikimedia.org/T185356) [23:08:40] (03PS1) 10RobH: eventlog1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/406129 (https://phabricator.wikimedia.org/T185667) [23:08:46] .... wtffff [23:09:00] (03PS2) 10RobH: eventlog1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/406129 (https://phabricator.wikimedia.org/T185667) [23:09:05] bah, bad local state push [23:09:11] easy enough to fix, just annoying. [23:10:58] (03CR) 10RobH: [C: 032] eventlog1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/406129 (https://phabricator.wikimedia.org/T185667) (owner: 10RobH) [23:11:57] 10Operations, 10Analytics: setup/install evenlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922578 (10RobH) [23:15:47] !log cp4025: restart varnish backend due to mbox lag [23:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:07] RECOVERY - Check Varnish expiry mailbox lag on cp4025 is OK: OK: expiry mailbox lag is 0 [23:22:11] (03PS3) 10RobH: adding new shell user Ramsey Isler [puppet] - 10https://gerrit.wikimedia.org/r/405981 (https://phabricator.wikimedia.org/T185356) [23:24:27] PROBLEM - HHVM rendering on mw2110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:25:17] RECOVERY - HHVM rendering on mw2110 is OK: HTTP OK: HTTP/1.1 200 OK - 75335 bytes in 0.312 second response time [23:37:05] (03PS1) 10Jcrespo: mariadb: Move socket location, disable notifications of es2011 [puppet] - 10https://gerrit.wikimedia.org/r/406130 [23:39:43] (03CR) 10Jcrespo: [C: 032] mariadb: Move socket location, disable notifications of es2011 [puppet] - 10https://gerrit.wikimedia.org/r/406130 (owner: 10Jcrespo) [23:41:56] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2011 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406132 [23:45:47] 10Operations, 10DC-Ops, 10Patch-For-Review: document all scs connections - https://phabricator.wikimedia.org/T175876#3922597 (10RobH) a:05RobH>03ayounsi scs-c1-eqiad.mgmt.eqiad.wmnet is online [23:46:22] 10Operations, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876#3922599 (10RobH) [23:51:24] (03PS1) 10Jcrespo: Revert "Revert "mariadb: Reimage es2011 into stretch/MariaDB 10.1"" [puppet] - 10https://gerrit.wikimedia.org/r/406133 [23:51:31] (03PS2) 10Jcrespo: Revert "Revert "mariadb: Reimage es2011 into stretch/MariaDB 10.1"" [puppet] - 10https://gerrit.wikimedia.org/r/406133 [23:52:34] (03PS1) 10Jcrespo: Revert "mariadb-partman: Modify recipe, to test with es2011" [puppet] - 10https://gerrit.wikimedia.org/r/406134 [23:52:40] (03PS2) 10Jcrespo: Revert "mariadb-partman: Modify recipe, to test with es2011" [puppet] - 10https://gerrit.wikimedia.org/r/406134 [23:53:53] (03CR) 10Jcrespo: [C: 032] Revert "mariadb-partman: Modify recipe, to test with es2011" [puppet] - 10https://gerrit.wikimedia.org/r/406134 (owner: 10Jcrespo) [23:54:19] (03PS3) 10Jcrespo: Revert "Revert "mariadb: Reimage es2011 into stretch/MariaDB 10.1"" [puppet] - 10https://gerrit.wikimedia.org/r/406133 [23:55:10] (03CR) 10Jcrespo: [C: 032] Revert "Revert "mariadb: Reimage es2011 into stretch/MariaDB 10.1"" [puppet] - 10https://gerrit.wikimedia.org/r/406133 (owner: 10Jcrespo) [23:55:40] (03CR) 10Jcrespo: [C: 04-2] "Not until es2011 is back in a production state" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406132 (owner: 10Jcrespo)