[00:00:04] Deploy window No Deploys - US Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170529T0000) [00:00:14] 06Operations, 06Commons, 10media-storage: More missing 'original' files on Commons - https://phabricator.wikimedia.org/T163068#3297325 (10Revent) I fixed two of those by uploading a new copy from youtube (the source), and got the author (Anna Frodesiak) to upload a new copy of another one. [00:00:15] (notes that there) [00:02:33] 06Operations, 10TimedMediaHandler, 10media-storage: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3297326 (10Revent) [00:04:10] I actually had the ‘failed transcodes’ down to around 150 or so, now it’s doubled. [00:04:45] (given how big it was back when I started nagging at people about this, that’s frankly outstanding… [00:05:56] Dereckson: ^ if you are unaware, as of November or so something over 10% of transcodes on Commons were failed [00:07:08] yes, I remember [00:07:53] A ‘lot’ of the ones that are left (before this new influx) are simply massive files… [00:12:21] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:13:11] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [00:13:21] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:13:21] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:11] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [00:14:11] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [00:59:36] 06Operations, 06Commons, 10media-storage: More missing 'original' files on Commons - https://phabricator.wikimedia.org/T163068#3297353 (10Revent) Some new examples (with audio files)... https://commons.wikimedia.org/wiki/File:En-uk-autochthon.opus https://commons.wikimedia.org/wiki/File:Zh-yue-%E8%A9%90%E5%... [01:00:14] ^ Dereckson if that helps figure out where the issue is [01:30:31] PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:31] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:31] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:31:21] RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [01:31:21] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:31:21] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [02:24:25] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 08m 20s) [02:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:51] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=542.00 Read Requests/Sec=464.90 Write Requests/Sec=2.60 KBytes Read/Sec=38018.00 KBytes_Written/Sec=68.80 [04:16:51] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.80 Read Requests/Sec=6.80 Write Requests/Sec=45.50 KBytes Read/Sec=28.00 KBytes_Written/Sec=263.60 [04:38:51] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [04:39:21] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [05:10:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [05:11:31] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [05:54:01] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355995 (https://phabricator.wikimedia.org/T162807) [05:54:51] (03PS2) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355995 (https://phabricator.wikimedia.org/T166206) [05:54:54] !log Restart MySQL on db1047 - T166452 [05:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:05] T166452: db1047 has been restarted - needs another restart - https://phabricator.wikimedia.org/T166452 [05:56:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355995 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [05:57:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355995 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [05:57:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355995 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [05:58:01] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [05:58:11] ^ that is because of db1047 being restarted [05:58:51] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [06:01:30] (03PS1) 10Marostegui: db-eqiad.php: Repool db1091, depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355999 (https://phabricator.wikimedia.org/T166206) [06:01:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1091 - T166206 (duration: 03m 01s) [06:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:41] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [06:03:19] (03CR) 10ArielGlenn: "Sounds good to me." [puppet] - 10https://gerrit.wikimedia.org/r/355783 (owner: 10Jcrespo) [06:03:24] (03CR) 10ArielGlenn: [C: 031] mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 (owner: 10Jcrespo) [06:04:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1091, depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355999 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:07:21] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1091, depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355999 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:07:30] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1091, depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355999 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:10:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1091, depool db1084 - T166206 (duration: 02m 45s) [06:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:33] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [06:11:02] !log Deploy alter table on s4 db1084 - T166206 [06:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:51] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [06:12:01] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [06:27:21] (03PS1) 10Marostegui: db-codfw.php: Repool db2043, depool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356006 (https://phabricator.wikimedia.org/T166278) [06:28:44] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2043, depool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356006 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:29:55] <_joe_> !log powercycling mw1294 [06:30:02] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2043, depool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356006 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:14] (03CR) 10jenkins-bot: db-codfw.php: Repool db2043, depool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356006 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:32:11] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2043, depool db2036 - T166278 (duration: 01m 44s) [06:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:19] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:32:29] !log Deploy alter table s3 - db2036 - T166278 [06:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:41] RECOVERY - Host mw1294 is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms [06:33:03] <_joe_> !log trying to restart pdfrender on scb1002 [06:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:41] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:34:51] PROBLEM - Nginx local proxy to apache on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:34:51] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [06:35:09] <_joe_> all known ^^ [06:35:31] PROBLEM - nutcracker port on mw1294 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [06:35:31] PROBLEM - nutcracker process on mw1294 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [06:35:41] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 80599 bytes in 5.082 second response time [06:35:42] RECOVERY - Nginx local proxy to apache on mw1294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.883 second response time [06:36:31] RECOVERY - nutcracker port on mw1294 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [06:36:31] RECOVERY - nutcracker process on mw1294 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:38:51] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:39:22] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] [06:41:25] !log Deploy alter table s4 - dbstore1002 - T166206 [06:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:33] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [06:42:52] !log Deploy alter table s3 - dbstore2002 - T166278 [06:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:01] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:43:21] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [06:45:17] <_joe_> !log restarting changeprop on scb1002, using 15 gigs of RAM [06:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:11] 06Operations, 05Goal, 07kubernetes: Prepare to service applications from kubernetes - https://phabricator.wikimedia.org/T162039#3297471 (10Joe) [06:50:13] 06Operations, 05Goal, 13Patch-For-Review, 15User-Joe, 07kubernetes: Upgrade calico to 2.2, document build process. - https://phabricator.wikimedia.org/T165024#3297470 (10Joe) 05Open>03Resolved [06:56:55] 06Operations, 06Performance-Team, 06Services, 07Availability (Multiple-active-datacenters): Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#3297472 (10Joe) I've seen there hasn't been much going on on this task, but I want to have the opportuni... [06:57:11] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 879.06 seconds [06:58:27] ^ beacuse of the alter table [06:58:29] I will silence it [07:01:48] <_joe_> !log reeanbling scap on mw2140, T166328 [07:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:57] T166328: mw2140.codfw.wmnet unreponsive, cannot be powercycled with serial console - https://phabricator.wikimedia.org/T166328 [07:17:41] RECOVERY - mediawiki-installation DSH group on mw2140 is OK: OK [07:27:10] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3297517 (10Marostegui) [07:27:12] 06Operations, 10ops-eqiad, 10DBA: Decommission db1023 - https://phabricator.wikimedia.org/T166486#3297505 (10Marostegui) [07:29:37] (03CR) 10Hashar: [C: 031] Adding media.static.onlinesammlung.thenetexperts.info to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355881 (https://phabricator.wikimedia.org/T166437) (owner: 10Multichill) [07:29:44] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Decommission db1023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356010 (https://phabricator.wikimedia.org/T166486) [07:33:02] (03PS1) 10Marostegui: mariadb: Decommission db1023 [puppet] - 10https://gerrit.wikimedia.org/r/356011 (https://phabricator.wikimedia.org/T166486) [07:34:45] (03PS1) 10Marostegui: s6.hosts: Decommission db1023 [software] - 10https://gerrit.wikimedia.org/r/356012 (https://phabricator.wikimedia.org/T166486) [07:38:30] !log Stop MySQL on db1095 to take a backup - this will make labsdb1009,10 and 11 break replication while it is down - T153743 [07:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:39] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:45:51] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:46:41] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 211 bytes in 0.074 second response time [07:51:07] Revent: I just saw your ping. Anything I can help with ? [07:51:40] 06Operations, 10Electron-PDFs, 06Services, 13Patch-For-Review: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10Joe) The problem (pdfrender hanging at startup) just showed up again on scb1002, and it seems there is no way to get ar... [07:51:41] Sec (scroll) [07:52:18] akosiaris: https://phabricator.wikimedia.org/T166482 [07:52:52] No idea if it’s something easily repairable, Dereckson told me to open a ticket. [07:52:55] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/355896 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [07:53:00] (03PS2) 10Alexandros Kosiaris: Monitoring: remove spaces from list of interfaces [puppet] - 10https://gerrit.wikimedia.org/r/355896 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [07:53:04] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Monitoring: remove spaces from list of interfaces [puppet] - 10https://gerrit.wikimedia.org/r/355896 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [07:53:53] akosiaris: OUTSTANDING response time for being listed as ‘on duty’, BTW. :P [07:54:25] (Yes, I know, Sunday on a holiday weekend, I can still give you crap) [07:55:51] :-) [07:55:55] (03CR) 10Jcrespo: [C: 031] db-eqiad,db-codfw.php: Decommission db1023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356010 (https://phabricator.wikimedia.org/T166486) (owner: 10Marostegui) [07:57:11] unknown error ... nice [07:57:23] TBH, I was originally wondering if it was something as simple as a leak of disk space on the scalers, but I have not seen the bot bitching. [07:58:01] PROBLEM - Host scb2006 is DOWN: PING CRITICAL - Packet loss = 100% [07:58:20] akosiaris: Yes, and only on ‘some’ resolutions of ‘some’ files, and persistent through resets of those particulat transcodes..... [07:58:35] * akosiaris looking in scb2006 [07:58:38] Good luck.... [07:58:41] RECOVERY - Host scb2006 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [07:58:47] huh ? [07:58:59] rebooted [07:59:29] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Decommission db1023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356010 (https://phabricator.wikimedia.org/T166486) (owner: 10Marostegui) [08:00:22] I have tried resetting the particular transcodes when the scalers were not loaded, they still show the same behavior… run for an appropriate time, then fail and leave that message in the DB. [08:00:39] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Decommission db1023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356010 (https://phabricator.wikimedia.org/T166486) (owner: 10Marostegui) [08:00:46] Revent: o/ - do you have any timeline about failed jobs ? It might help in figuring out what happened from the logs [08:00:49] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Decommission db1023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356010 (https://phabricator.wikimedia.org/T166486) (owner: 10Marostegui) [08:00:57] And some of the ones doing it are quite small, that should just run in a couple of minutes. [08:02:03] elukey: It seems to have popped up in the last 3-4 days, I have been offline a lot the last couple of weeks, on and off, so I can’t really say when exactly. [08:02:06] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1023 - T166486 (duration: 00m 42s) [08:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:14] T166486: Decommission db1023 - https://phabricator.wikimedia.org/T166486 [08:02:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1023 - T166486 (duration: 00m 41s) [08:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:53] elukey: looking at the Quarry of ‘failed’, to see if it gives any indication of when it started. [08:04:48] super thanks! [08:05:31] (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/6552/" [puppet] - 10https://gerrit.wikimedia.org/r/356011 (https://phabricator.wikimedia.org/T166486) (owner: 10Marostegui) [08:06:22] elukey: I had gotten ‘failed’ down to about 150, a lot of which are just stupid big, but it has recently doubled. [08:07:23] There were a lot of old ones that were just due to broken bot transcodes from vids in open source scientific papers. [08:08:01] ACKNOWLEDGEMENT - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused Giuseppe Lavagetto T159922 [08:08:46] elukey: https://quarry.wmflabs.org/query/14029 is, for some reason, being stupid slow to load. [08:18:01] 06Operations, 10ops-codfw: mw2140.codfw.wmnet unreponsive, cannot be powercycled with serial console - https://phabricator.wikimedia.org/T166328#3297586 (10jcrespo) a:05jcrespo>03Joe Ok to resolve, @Joe or is there any more debug we want to do? [08:18:19] akosiaris: thanks for reviewing and merging :) [re: r/355896] [08:30:30] elukey: I’m still attempting to get an estimate of when that issue started… that particular query is hell, because the wiki throws incredibly stupidly long data into transcode_error [08:32:05] (03PS4) 10Jcrespo: mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 [08:32:24] (03PS1) 10Filippo Giunchedi: decom ms-be2001 - ms-be2012 [puppet] - 10https://gerrit.wikimedia.org/r/356017 (https://phabricator.wikimedia.org/T162785) [08:33:11] Revent: thanks a lot, we can also focus on the timing of two/three jobs.. my point was that with timings in the task it will be quicker to dig into logs and see what happened [08:33:22] Yeah, I understand [08:33:59] (03PS2) 10Marostegui: mariadb: Decommission db1023 [puppet] - 10https://gerrit.wikimedia.org/r/356011 (https://phabricator.wikimedia.org/T166486) [08:34:20] elukey: Look at transcode_error in https://quarry.wmflabs.org/query/18971 tho…. just as an example of why the query of “select * from commonswiki_p.transcode [08:34:20] WHERE transcode_time_startwork IS NOT NULL [08:34:21] AND transcode_time_error IS NOT NULL” is stupid slow. [08:35:04] (03CR) 10Marostegui: [C: 032] mariadb: Decommission db1023 [puppet] - 10https://gerrit.wikimedia.org/r/356011 (https://phabricator.wikimedia.org/T166486) (owner: 10Marostegui) [08:35:40] (03CR) 10Marostegui: [C: 032] s6.hosts: Decommission db1023 [software] - 10https://gerrit.wikimedia.org/r/356012 (https://phabricator.wikimedia.org/T166486) (owner: 10Marostegui) [08:36:00] It loads a *lot* of those stupidly long errors... [08:36:10] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1023 - https://phabricator.wikimedia.org/T166486#3297610 (10Marostegui) a:03Cmjohnson [08:36:17] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6553/" [puppet] - 10https://gerrit.wikimedia.org/r/356017 (https://phabricator.wikimedia.org/T162785) (owner: 10Filippo Giunchedi) [08:36:28] (03Merged) 10jenkins-bot: s6.hosts: Decommission db1023 [software] - 10https://gerrit.wikimedia.org/r/356012 (https://phabricator.wikimedia.org/T166486) (owner: 10Marostegui) [08:36:33] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1023 - https://phabricator.wikimedia.org/T166486#3297505 (10Marostegui) This is ready for @Cmjohnson to take over whenever he can [08:37:15] Revent: I wasn't logged and I didn't see the output, checking [08:37:27] (03PS5) 10Jcrespo: mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 [08:37:29] (03PS1) 10Jcrespo: mariadb: Increase the binlog_cache_size to 10M [puppet] - 10https://gerrit.wikimedia.org/r/356018 [08:37:35] (03PS2) 10Filippo Giunchedi: decom ms-be2001 - ms-be2012 [puppet] - 10https://gerrit.wikimedia.org/r/356017 (https://phabricator.wikimedia.org/T162785) [08:38:12] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/6554/" [puppet] - 10https://gerrit.wikimedia.org/r/355783 (owner: 10Jcrespo) [08:38:18] (03PS6) 10Jcrespo: mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 [08:38:57] (03CR) 10Marostegui: [C: 031] mariadb: Increase the binlog_cache_size to 10M [puppet] - 10https://gerrit.wikimedia.org/r/356018 (owner: 10Jcrespo) [08:39:12] (03CR) 10Filippo Giunchedi: [C: 032] decom ms-be2001 - ms-be2012 [puppet] - 10https://gerrit.wikimedia.org/r/356017 (https://phabricator.wikimedia.org/T162785) (owner: 10Filippo Giunchedi) [08:43:01] 06Operations, 10netops: Filter outgoing BGP announcements on AS regex - https://phabricator.wikimedia.org/T83037#3297617 (10ayounsi) "remove-private" added to all the cr* routers, for the IX/Private/Transit groups. [08:45:04] (03PS7) 10Jcrespo: mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 [08:45:43] (03CR) 10Muehlenhoff: [C: 031] mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 (owner: 10Jcrespo) [08:46:57] (03PS2) 10Jcrespo: mariadb: Increase the binlog_cache_size to 10M [puppet] - 10https://gerrit.wikimedia.org/r/356018 [08:48:04] (03CR) 10Jcrespo: [C: 032] mariadb: Increase the binlog_cache_size to 10M [puppet] - 10https://gerrit.wikimedia.org/r/356018 (owner: 10Jcrespo) [08:52:18] 06Operations, 13Patch-For-Review: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372#3297624 (10Volans) After the above was merged now all the `labvirt*` instances have no diff, hence all the differences are just the string vs. integer of the `$::processorcount` as... [08:52:50] (03CR) 10Volans: [C: 032] Transports: allow to specify exit codes per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352845 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:53:32] (03Merged) 10jenkins-bot: Transports: allow to specify exit codes per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352845 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:53:48] (03PS2) 10Volans: ClusterShell: allow to specify exit codes per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352892 (https://phabricator.wikimedia.org/T164833) [08:53:54] 06Operations, 15User-fgiunchedi: Decommission ms-be1001 - ms-be1012 - https://phabricator.wikimedia.org/T166489#3297629 (10fgiunchedi) [08:54:05] 06Operations, 15User-fgiunchedi: Decommission ms-be1001 - ms-be1012 - https://phabricator.wikimedia.org/T166489#3297642 (10fgiunchedi) p:05Triage>03Normal [08:58:02] (03CR) 10Volans: [C: 032] ClusterShell: allow to specify exit codes per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352892 (https://phabricator.wikimedia.org/T164833) (owner: 10Volans) [08:58:32] (03Merged) 10jenkins-bot: ClusterShell: allow to specify exit codes per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352892 (https://phabricator.wikimedia.org/T164833) (owner: 10Volans) [08:58:53] (03PS3) 10Volans: CLI: add -i/--interactive option [software/cumin] - 10https://gerrit.wikimedia.org/r/354442 (https://phabricator.wikimedia.org/T165838) [08:59:25] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3297658 (10fgiunchedi) 05Open>03Resolved All hosts at weight 4000 and in service, decom task for correspondent old hw is T166489 [09:00:42] (03PS2) 10Volans: CLI: add -o/--output to get the output in different formats [software/cumin] - 10https://gerrit.wikimedia.org/r/354637 (https://phabricator.wikimedia.org/T165842) [09:01:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] gerrit: dont let sshd listen on all interfaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [09:01:40] !log Drop gather tables from: testwiki, test2wiki, enwikivoyage, hewiki, enwiki - T166097 [09:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:50] T166097: Drop Gather tables from wmf wikis - https://phabricator.wikimedia.org/T166097 [09:02:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Same comment as for https://gerrit.wikimedia.org/r/#/c/354074/2/modules/gerrit/templates/gerrit.config.erb" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [09:07:37] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3297675 (10akosiaris) I am thinking this was related to T166203#3294072. TL;DR a daemonized puppet agent was running on tegmen. The timing of the killing of that da... [09:08:28] (03CR) 10Alexandros Kosiaris: [C: 032] Fix spec for various modules [puppet] - 10https://gerrit.wikimedia.org/r/355695 (owner: 10Hashar) [09:08:32] (03PS2) 10Alexandros Kosiaris: Fix spec for various modules [puppet] - 10https://gerrit.wikimedia.org/r/355695 (owner: 10Hashar) [09:08:54] 06Operations, 10hardware-requests, 15User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3297677 (10fgiunchedi) a:05fgiunchedi>03None [09:08:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix spec for various modules [puppet] - 10https://gerrit.wikimedia.org/r/355695 (owner: 10Hashar) [09:09:20] elukey: I don’t know if it’s Quarry, or my browser, glictching out, but I have been unable so far to get a Quarry result that I can attempt to sort to see when the error started. I’m attempting to add the needed sort to the SQL (and ther should be some sane limit on the length of transcode_error) [09:11:41] Yeah, even sorting the query itself by transcode_time_error, I still get a results page that goes blank less than halfway down... [09:12:25] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3225440 (10Volans) @akosiaris actually this happened ~2h after I've killed the daemonized puppet on tegmen... I'm not sure this explanation can still be valid, thou... [09:13:13] Revent: I'll check after lunch I promise, currently finishing some work.. [09:13:24] (nods) [09:13:47] It’s not an ‘end of the world’ kind of thing. [09:14:16] I’m just hoping it’s a stupid fix. :) [09:15:37] (03PS1) 10Muehlenhoff: Record extended account expiry date for piccari [puppet] - 10https://gerrit.wikimedia.org/r/356019 [09:17:51] elukey: BTW, really, transcode_error should truncate after a few KB…. does a huge listing of “frame= 2 fps=0.0 q=0.0 size= 17kB time=00:00:00.62 bitrate= 226.6kbits/s frame= 4 fps=2.4 q=0.0 size= 78kB time=00:00:00.68 bitrate= 940.0kbits/s frame= 7 fps=3.0 q=0.0 size= 158kB time=00:00:00.76 bitrate=1687.1kbits/s “ accomplish anything? [09:17:59] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3297688 (10akosiaris) Hm, I did a mental -2h on Jaime's reported time above to reach UTC and it lined up. For some reason I still do it when I see no TZ information... [09:19:05] (03PS1) 10Marostegui: db-eqiad.php: Repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356020 [09:20:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356020 (owner: 10Marostegui) [09:21:20] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3297692 (10jcrespo) I reported on UTC times, I always do. [09:21:33] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356020 (owner: 10Marostegui) [09:21:42] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356020 (owner: 10Marostegui) [09:22:48] (03CR) 10Muehlenhoff: [C: 032] Record extended account expiry date for piccari [puppet] - 10https://gerrit.wikimedia.org/r/356019 (owner: 10Muehlenhoff) [09:22:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1045 (duration: 00m 41s) [09:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:49] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3297714 (10Volans) Yes @akosiaris , all the times it happened was during a cron puppet run and seems to me only when there are changes in the `puppet_hosts.cfg` gen... [09:29:36] (03PS1) 10Alexandros Kosiaris: Timestamp puppet-run logs [puppet] - 10https://gerrit.wikimedia.org/r/356021 (https://phabricator.wikimedia.org/T164206) [09:29:38] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review, 06Release-Engineering-Team (Kanban), and 2 others: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#3297715 (10zeljkofilipin) [09:29:54] (03PS3) 10Filippo Giunchedi: phabricator: redirect serveraliases homepage to phab_servername [puppet] - 10https://gerrit.wikimedia.org/r/355769 (https://phabricator.wikimedia.org/T166120) [09:30:17] (03CR) 10Filippo Giunchedi: "> I've tested this. I've found that this has to be way up near the" [puppet] - 10https://gerrit.wikimedia.org/r/355769 (https://phabricator.wikimedia.org/T166120) (owner: 10Filippo Giunchedi) [09:31:25] 06Operations, 10Icinga, 10Monitoring, 13Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3297720 (10akosiaris) Thanks for setting an example on this one. So the daemonized puppet agent theory is ruled out :-(. Looking at the log... [09:31:47] (03PS7) 10Filippo Giunchedi: prometheus: report puppet agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354007 [09:33:32] 06Operations, 10Icinga, 10Monitoring, 13Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3297721 (10akosiaris) >>! In T164206#3297714, @Volans wrote: > Yes @akosiaris , all the times it happened was during a cron puppet run and see... [09:40:58] 06Operations, 10Icinga, 10Monitoring, 13Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3297736 (10jcrespo) >>! In T164206#3297720, @akosiaris wrote: > While a tad difficult to figure out the correct run due to the log not being t... [09:53:03] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: report puppet agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354007 (owner: 10Filippo Giunchedi) [09:53:12] !log upgrade remaining mw* hosts already running HHVM 3.18 to 3.18.2+dfsg-1+wmf4 [09:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:21] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6555/" [puppet] - 10https://gerrit.wikimedia.org/r/354457 (owner: 10Filippo Giunchedi) [10:05:57] (03CR) 10Filippo Giunchedi: Setup apache vhost on scap proxies as well (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [10:06:15] 06Operations, 10Traffic: Can't upload large files with X-Wikimedia-Debug turned on - https://phabricator.wikimedia.org/T165324#3297810 (10ema) p:05Triage>03Normal [10:06:42] godog: by any chance did you tested it 354457 in labs or somewhere? just to be sure before adding it to the fleet [10:06:57] (I have a puppetmaster in labs if needed ;) ) [10:08:01] (03PS1) 10Ema: debug_proxy: increase client_max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/356022 (https://phabricator.wikimedia.org/T165324) [10:08:32] volans: yeah it is cherry-picked in beta [10:08:56] great ,than good to go for me [10:08:57] let me rebase it there again [10:16:22] indeed there's a mistake now, 'prometheus' user isn't part of 'prometheus-node-exporter' group [10:18:11] nice :) [10:18:34] !log upgrade nginx to 1.11.10-1+wmf1 on hassium and hassaleh [10:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:21] and 'prometheus' is a real user too in labs [10:22:32] I'll postpone it, I can see the rabbit hole from here [10:30:23] (03PS6) 10Filippo Giunchedi: base: report prometheus agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354457 [10:30:27] (03PS6) 10Filippo Giunchedi: prometheus: add alertmanager_url to prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/354459 [10:30:29] (03PS6) 10Filippo Giunchedi: role: use alertmanager in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354460 [10:30:31] (03PS5) 10Filippo Giunchedi: role: set external url for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/354975 [10:30:33] (03PS6) 10Filippo Giunchedi: WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 [10:30:37] (03PS1) 10Filippo Giunchedi: prometheus: allow user 'prometheus' to export metrics too [puppet] - 10https://gerrit.wikimedia.org/r/356023 [10:32:23] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 (owner: 10Filippo Giunchedi) [10:33:13] (03CR) 10Ema: [V: 032 C: 032] debug_proxy: increase client_max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/356022 (https://phabricator.wikimedia.org/T165324) (owner: 10Ema) [10:34:05] (03PS4) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) [10:35:56] 06Operations, 10Traffic, 13Patch-For-Review: Can't upload large files with X-Wikimedia-Debug turned on - https://phabricator.wikimedia.org/T165324#3263087 (10ema) @Gilles the problem should be fixed now, please let me know if that's not the case. [10:39:55] !log installing fop security updates [10:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:26] (03PS2) 10Filippo Giunchedi: prometheus: allow user 'prometheus' to export metrics too [puppet] - 10https://gerrit.wikimedia.org/r/356023 [10:40:28] (03PS7) 10Filippo Giunchedi: base: report prometheus agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354457 [10:40:30] (03PS7) 10Filippo Giunchedi: prometheus: add alertmanager_url to prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/354459 [10:40:32] (03PS7) 10Filippo Giunchedi: role: use alertmanager in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354460 [10:40:34] (03PS6) 10Filippo Giunchedi: role: set external url for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/354975 [10:40:36] (03PS7) 10Filippo Giunchedi: WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 [10:41:35] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 10 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm],Package[hhvm-dbg] [10:42:31] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 (owner: 10Filippo Giunchedi) [11:02:35] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [11:13:48] 06Operations, 10ops-eqiad, 06Labs: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T165220#3260322 (10Volans) Should this be resolved? There is still a disk with predictive failure, but not yet failed: ``` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does no... [11:14:36] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3298035 (10Volans) [11:16:15] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3298040 (10Marostegui) @elukey @Ottomata maybe this can be done along with: T166141 [11:17:45] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3298063 (10elukey) +1 [11:18:05] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3298064 (10Volans) 05Open>03Resolved All looks good, resolving for now: ``` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components in optimal state) === RaidStatus... [11:22:15] (03PS5) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) [11:23:05] PROBLEM - Check size of conntrack table on ms-fe1005 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [11:27:05] RECOVERY - Check size of conntrack table on ms-fe1005 is OK: OK: nf_conntrack is 77 % full [11:27:30] ^ ms-fe1005 was also hit by https://phabricator.wikimedia.org/T136094, set net.netfilter.nf_conntrack_tcp_timeout_time_wait to 65 [11:31:46] 06Operations, 10Traffic, 13Patch-For-Review: Can't upload large files with X-Wikimedia-Debug turned on - https://phabricator.wikimedia.org/T165324#3298137 (10Gilles) 05Open>03Resolved a:03Gilles [11:39:40] (03PS1) 10Muehlenhoff: Remove obsolete mediawiki::migrate class [puppet] - 10https://gerrit.wikimedia.org/r/356026 [11:41:09] (03CR) 10Faidon Liambotis: [C: 04-2] "This won't work and I don't think you understand what this does..." [puppet] - 10https://gerrit.wikimedia.org/r/355717 (https://phabricator.wikimedia.org/T166322) (owner: 10Paladox) [11:42:38] (03CR) 10Paladox: "> This won't work and I don't think you understand what this does..." [puppet] - 10https://gerrit.wikimedia.org/r/355717 (https://phabricator.wikimedia.org/T166322) (owner: 10Paladox) [11:44:19] 06Operations, 10ops-esams, 10Traffic: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3153465 (10faidon) Note that due to various other changes in the infrastructure since this went offline, we'll need to reinstall the system when (if?) it comes back up online. [11:46:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356027 (https://phabricator.wikimedia.org/T153743) [11:49:07] h [11:51:17] 06Operations: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203#3298177 (10faidon) 05Resolved>03Open Since we're planning to gradually introduce features (e.g. structured facts) that are Facter >= 2 specific, we should probably do the same upgrade on Labs hosts as well. Since this h... [11:51:40] 06Operations, 06Labs: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203#3298179 (10faidon) p:05Triage>03Normal [12:19:01] (03PS9) 10Hashar: contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [12:19:38] (03PS1) 10Alexandros Kosiaris: Kubernetes: Add IPv6 mapped addresses [puppet] - 10https://gerrit.wikimedia.org/r/356029 [12:20:10] (03PS1) 10Faidon Liambotis: raid: switch from stringified fact to array [puppet] - 10https://gerrit.wikimedia.org/r/356030 (https://phabricator.wikimedia.org/T166372) [12:20:12] (03PS1) 10Faidon Liambotis: Remove str2bool from is_virtual facts [puppet] - 10https://gerrit.wikimedia.org/r/356031 (https://phabricator.wikimedia.org/T166372) [12:20:14] (03PS1) 10Faidon Liambotis: Remove to_i/Integer from now unstringified facts [puppet] - 10https://gerrit.wikimedia.org/r/356032 (https://phabricator.wikimedia.org/T166372) [12:21:21] (03PS3) 10Faidon Liambotis: Rewrite the LLDP fact(s) [puppet] - 10https://gerrit.wikimedia.org/r/354084 [12:21:23] (03PS2) 10Faidon Liambotis: Do not confine LLDP fact to physical/non-VMs [puppet] - 10https://gerrit.wikimedia.org/r/354108 [12:23:18] (03CR) 10Alexandros Kosiaris: [C: 032] Kubernetes: Add IPv6 mapped addresses [puppet] - 10https://gerrit.wikimedia.org/r/356029 (owner: 10Alexandros Kosiaris) [12:23:36] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 17 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm],Package[hhvm-dbg] [12:23:36] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 17 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[hhvm],Package[hhvm-dbg] [12:24:33] (03CR) 10Hashar: [C: 031] contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [12:25:20] (03CR) 10Hashar: [C: 031] "cherry picked on CI and looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [12:26:24] (03CR) 10Hashar: [V: 031 C: 032] Test wgLogoHD keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [12:26:35] 06Operations, 10TimedMediaHandler, 10media-storage: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3297326 (10akosiaris) Looking at the logs, I find a 413 error in swift. ``` FileOperation.log-20170529:2017-05-28 15:00:47 [WSqfmApAEDMAAdnL0D... [12:26:40] (03CR) 10Hashar: [C: 031] test: factor out wgConf loading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355440 (owner: 10Hashar) [12:26:45] 06Operations, 10TimedMediaHandler, 10media-storage: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3298226 (10akosiaris) p:05Triage>03High [12:29:11] (03PS1) 10Gehel: elasticsearch - introduce curator configs to enable / disable replication [puppet] - 10https://gerrit.wikimedia.org/r/356033 (https://phabricator.wikimedia.org/T166154) [12:29:57] (03CR) 10Alexandros Kosiaris: [C: 031] "I am going to tentatively merge this, feel free to adjust in future patches" [dns] - 10https://gerrit.wikimedia.org/r/341794 (https://phabricator.wikimedia.org/T165732) (owner: 10Alexandros Kosiaris) [12:30:01] (03PS4) 10Alexandros Kosiaris: Assign the kubernetes pod IPs in DNS [dns] - 10https://gerrit.wikimedia.org/r/341794 (https://phabricator.wikimedia.org/T165732) [12:30:06] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Assign the kubernetes pod IPs in DNS [dns] - 10https://gerrit.wikimedia.org/r/341794 (https://phabricator.wikimedia.org/T165732) (owner: 10Alexandros Kosiaris) [12:30:35] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:31:22] !log update kubernetes policy-options on cr{1,2}-{eqiad,codfw}. T165732 [12:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:31] T165732: Assigning IP space for kubernetes pod IPs - https://phabricator.wikimedia.org/T165732 [12:31:35] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:32:19] 06Operations, 05Goal, 13Patch-For-Review, 07kubernetes: Assigning IP space for kubernetes IPs - https://phabricator.wikimedia.org/T165732#3298233 (10akosiaris) [12:33:16] 06Operations, 05Goal, 13Patch-For-Review, 07kubernetes: Assigning IP space for kubernetes IPs - https://phabricator.wikimedia.org/T165732#3275340 (10akosiaris) In the interest of having this documented and not use some `192.168/16` IP space I 've assigned IP spaces for the service IPs ranges as well [12:36:13] !log installing imagemagick security updates on jessie [12:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:50] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3298245 (10Gilles) There are no xhprof runs in xhgui for the initial deployment window whe... [12:41:13] !log disable puppet across codfw/ulsfo for puppetmaster upgrade. This should avoid any irc spam about failed puppet agent runs [12:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:49] (03CR) 10DCausse: [C: 031] elasticsearch - introduce curator configs to enable / disable replication [puppet] - 10https://gerrit.wikimedia.org/r/356033 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [12:41:51] (03PS6) 10Paladox: Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 [12:47:19] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review, 15User-fgiunchedi: Implement DC-local cache failure limiter in Thumbor - https://phabricator.wikimedia.org/T151065#3298251 (10Gilles) 05Open>03Resolved [12:49:41] (03PS6) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) [12:52:02] !log enable puppet across codfw/ulsfo after puppetmaster upgrade [12:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:54] (03PS2) 10Gehel: elasticsearch - introduce curator configs to enable / disable replication [puppet] - 10https://gerrit.wikimedia.org/r/356033 (https://phabricator.wikimedia.org/T166154) [12:52:57] !log disable puppet across eqiad/esams for puppetmaster upgrade. This should avoid any irc spam about failed puppet agent runs [12:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:11] * akosiaris loves cumin [12:53:14] volans: ^ [12:53:23] akosiaris: lol, thanks :) [12:53:32] I just noticed you tricked it, kudos [12:53:49] I tricked it ? [12:54:10] using the 'R:class%site = ulsfo' selector without the 'R:class = profile::cumin::target' part ;) [12:54:18] ahahaha [12:54:28] I shouldn't have ? [12:54:36] it did work out ok ;-) [12:54:43] I guess it might select additional hosts that have any class with a parameter "site" with value "ulsfo" [12:54:51] (03CR) 10Gehel: [C: 032] elasticsearch - introduce curator configs to enable / disable replication [puppet] - 10https://gerrit.wikimedia.org/r/356033 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [12:55:05] given that the generated query is: ["and", ["=", "type", "Class"], ["=", ["parameter", "site"], "ulsfo"]] [12:55:08] oh we should not have that [12:55:18] I 've chased that pattern down back 1,5 years ago [12:56:19] nice shortcut then, if it's "safe" :D I think for now I'll leave the longer version on wikitech [12:58:11] akosiaris: (remotely related) - I am trying to convert role::zookeeper to profiles. I am using in there ${::hostname}_${::site} to create a monitoring metric dynamically, should I stop using those? [12:58:24] or is it still allowed in profiles? [12:58:45] (03CR) 10Mforns: [WIP] Add the eventlogging_cleaner script and base package (0313 comments) [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [12:59:10] elukey: I think it's fine [12:59:26] I see no reason to stop that behavior in a profile module. On the contrary [13:01:01] thanks :) [13:01:13] should be able to send a code review soon then [13:02:26] !log enable puppet across eqiad/esams after puppetmaster upgrade. [13:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:55] PROBLEM - puppet last run on mw2149 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP] [13:04:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356027 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [13:05:55] RECOVERY - puppet last run on mw2149 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:06:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356027 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [13:06:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356027 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [13:07:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1070 - T153743 (duration: 00m 41s) [13:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:54] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [13:10:22] !log Stop replication on db1070 to flush tables for export - T153743 [13:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:31] (03PS1) 10Faidon Liambotis: network: add router loopbacks subnets [puppet] - 10https://gerrit.wikimedia.org/r/356035 [13:12:54] (03PS1) 10Faidon Liambotis: pmacct: fix firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/356036 [13:13:02] akosiaris: ^ [13:13:14] akosiaris: these are the two versions, pick one :P [13:14:55] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[imagemagick] [13:16:38] (03PS7) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) [13:25:25] (03PS1) 10Filippo Giunchedi: Don't symlink systemd service instances [puppet] - 10https://gerrit.wikimedia.org/r/356038 (https://phabricator.wikimedia.org/T166389) [13:26:25] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 0.126 second response time [13:27:50] (03PS1) 10Alexandros Kosiaris: Utilize the allocated service ips in kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/356039 (https://phabricator.wikimedia.org/T165732) [13:33:50] (03PS1) 10Faidon Liambotis: pmacct: update config, push output to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/356040 [13:39:46] paravoid: awesome --^ [13:42:07] !log updating gdb on mw* servers [13:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:55] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:44:15] (03CR) 10Ema: "Any reasons for not merging this into master instead?" [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355876 (owner: 10Giuseppe Lavagetto) [13:49:02] (03CR) 10Alexandros Kosiaris: [C: 031] "Aside from where the data is actually stored, this actually is service-wise a preferred solution as the end ferm rule is as strict as poss" [puppet] - 10https://gerrit.wikimedia.org/r/356036 (owner: 10Faidon Liambotis) [13:51:30] (03CR) 10Alexandros Kosiaris: "I 've +1ed https://gerrit.wikimedia.org/r/#/c/356036/1 but I am not sure this is without its merits, at least as documentation. FWIW $DOMA" [puppet] - 10https://gerrit.wikimedia.org/r/356035 (owner: 10Faidon Liambotis) [13:51:55] (03PS2) 10Alexandros Kosiaris: Utilize the allocated service ips in kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/356039 (https://phabricator.wikimedia.org/T165732) [13:52:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Utilize the allocated service ips in kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/356039 (https://phabricator.wikimedia.org/T165732) (owner: 10Alexandros Kosiaris) [13:56:06] (03PS1) 10Elukey: Add zookeeper.yaml to hieradata common [labs/private] - 10https://gerrit.wikimedia.org/r/356042 [13:57:05] PROBLEM - Check systemd state on argon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:57:12] me ^ [13:57:55] RECOVERY - Check systemd state on argon is OK: OK - running: The system is fully operational [13:59:15] (03CR) 10Elukey: [V: 032 C: 032] Add zookeeper.yaml to hieradata common [labs/private] - 10https://gerrit.wikimedia.org/r/356042 (owner: 10Elukey) [13:59:35] (03PS1) 10Marostegui: db-codfw.php: Repool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356044 (https://phabricator.wikimedia.org/T166278) [14:01:25] !log Deploy alter table s3 on codfw master db2018 - T166278 [14:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:33] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [14:01:39] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356044 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [14:02:37] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356044 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [14:02:50] (03CR) 10jenkins-bot: db-codfw.php: Repool db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356044 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [14:03:42] (03CR) 10Giuseppe Lavagetto: "> Any reasons for not merging this into master instead?" [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355876 (owner: 10Giuseppe Lavagetto) [14:03:51] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2036 - T166278 (duration: 00m 41s) [14:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:25] !log starting upgrade to elasticsearch 5.3.2 on cirrus codfw cluster - T163708 [14:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:33] T163708: Upgrade the production search cluster to elastic 5.3.2 - https://phabricator.wikimedia.org/T163708 [14:05:00] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3298338 (10Ottomata) Robh, let's aim for +3 scb nodes in each DC. So +6 nodes total. [14:12:54] 06Operations, 10TimedMediaHandler, 10media-storage: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3298344 (10fgiunchedi) Indeed it looks like the upload exceeds the maximum swift file size (5GB by default). Though `wgMaxUploadSize` is 4GB no... [14:14:32] !log Stop MySQL labsdb1009 to take a backup - T153743 [14:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:40] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [14:15:23] !log reset remote for elasticsearch/plugins deployment - T163708 [14:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:31] T163708: Upgrade the production search cluster to elastic 5.3.2 - https://phabricator.wikimedia.org/T163708 [14:18:55] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:19:51] ^ that is labsdb1009 being down [14:19:54] for maintenance [14:20:56] (03Abandoned) 10Faidon Liambotis: Phabricator: Install the lighter version of exim4 on labs only [puppet] - 10https://gerrit.wikimedia.org/r/355717 (https://phabricator.wikimedia.org/T166322) (owner: 10Paladox) [14:22:39] (03PS1) 10Alexandros Kosiaris: Specify the correct service IPs for kubernetes clusters [puppet] - 10https://gerrit.wikimedia.org/r/356050 [14:23:55] (03PS2) 10Alexandros Kosiaris: Specify the correct service IPs for kubernetes clusters [puppet] - 10https://gerrit.wikimedia.org/r/356050 (https://phabricator.wikimedia.org/T165732) [14:24:02] !log upgrade prometheus-hhvm-exporter to 0.3-1 in codfw/eqiad with less verbose logging - T158286 [14:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:11] T158286: Raise default logging level of prometheus-hhvm-exporter - https://phabricator.wikimedia.org/T158286 [14:26:04] (03CR) 10Alexandros Kosiaris: [C: 032] Specify the correct service IPs for kubernetes clusters [puppet] - 10https://gerrit.wikimedia.org/r/356050 (https://phabricator.wikimedia.org/T165732) (owner: 10Alexandros Kosiaris) [14:29:48] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3298386 (10mobrovac) FYI, this expansion will also come in handy for the new service being developed by the Research team - the #recommendation-api service. [14:29:56] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad: (3)+ nodes for Druid / analytics - https://phabricator.wikimedia.org/T166510#3298388 (10Ottomata) [14:30:11] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad: hadoop expansion part deux - https://phabricator.wikimedia.org/T166509#3298403 (10Ottomata) [14:32:02] (03CR) 10Ema: Add netlink-based Ipvsmanager implementation (031 comment) [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [14:33:08] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3298405 (10Ottomata) p:05Triage>03High [14:33:14] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad: hadoop expansion part deux - https://phabricator.wikimedia.org/T166509#3298406 (10Ottomata) p:05Triage>03High [14:33:14] !log rebooting multatuli for systemd modules-load.d debugging [14:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:26] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3293197 (10Ottomata) p:05High>03Normal [14:33:36] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad: (3)+ nodes for Druid / analytics - https://phabricator.wikimedia.org/T166510#3298408 (10Ottomata) p:05Triage>03High [14:34:02] (03PS1) 10Gehel: elasticsearch - correct naming of curator config files [puppet] - 10https://gerrit.wikimedia.org/r/356052 (https://phabricator.wikimedia.org/T166154) [14:40:21] (03CR) 10Ema: Add netlink-based Ipvsmanager implementation (031 comment) [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [14:40:47] (03PS1) 10Giuseppe Lavagetto: role::scb: add nutcracker for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/356053 [14:46:12] (03CR) 10Ema: Add netlink-based Ipvsmanager implementation (031 comment) [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [14:47:59] (03CR) 10Ema: Add netlink-based Ipvsmanager implementation (031 comment) [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [14:50:16] (03CR) 10Giuseppe Lavagetto: [C: 032] role::scb: add nutcracker for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/356053 (owner: 10Giuseppe Lavagetto) [14:50:47] (03PS2) 10Giuseppe Lavagetto: role::scb: add nutcracker for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/356053 [14:50:56] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::scb: add nutcracker for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/356053 (owner: 10Giuseppe Lavagetto) [14:53:08] 06Operations, 10TimedMediaHandler, 10media-storage: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3297326 (10TheDJ) a transcoded version needs to use 'generic' transcoding settings, so might easily be larger than an optimised original indeed. [14:53:15] 06Operations, 06Performance-Team, 06Services, 07Availability (Multiple-active-datacenters): Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#3298450 (10mobrovac) >>! In T134811#3297472, @Joe wrote: > I've seen there hasn't been much going on on... [14:55:30] 06Operations, 06Performance-Team, 06Services, 07Availability (Multiple-active-datacenters): Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#3298453 (10Joe) >>! In T134811#3298450, @mobrovac wrote: >>>! In T134811#3297472, @Joe wrote: >> I've se... [14:55:33] (03CR) 10Muehlenhoff: [C: 031] "Looks fine, carbon-cache@, statsite@ and thumbor@ are provided via base::service_unit already, providing the correct fallbacks for the ins" [puppet] - 10https://gerrit.wikimedia.org/r/356038 (https://phabricator.wikimedia.org/T166389) (owner: 10Filippo Giunchedi) [14:57:45] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nutcracker] [14:57:58] (03PS1) 10Giuseppe Lavagetto: role::scb: fix redis host list for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/356054 [14:58:35] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nutcracker] [14:58:41] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::scb: fix redis host list for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/356054 (owner: 10Giuseppe Lavagetto) [14:59:55] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [15:00:55] PROBLEM - puppet last run on mw2224 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[imagemagick] [15:02:35] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nutcracker] [15:02:55] !log restarting wdqs-updater on wdqs1002 [15:02:55] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 37.93% of data above the critical threshold [1800.0] [15:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:25] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:04:35] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:04:45] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:12:19] (03PS8) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) [15:17:05] 06Operations, 13Patch-For-Review: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094#3298506 (10MoritzMuehlenhoff) The modules-load.d approach mentioned in sysctl.d isn't sufficiently race-free: While systemd-sysctl.service has a "After: syste... [15:26:30] PROBLEM - nutcracker port on scb1004 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:27:00] PROBLEM - nutcracker port on scb2005 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:27:10] PROBLEM - nutcracker port on scb2003 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:27:17] <_joe_> heh it's all so fucked up ^^ [15:27:27] <_joe_> (that's part of what I was referring to) [15:27:30] PROBLEM - nutcracker port on scb1001 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:27:50] PROBLEM - nutcracker port on scb2006 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:28:05] <_joe_> I'll ack them all in a second [15:28:10] PROBLEM - nutcracker port on scb2001 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:28:10] PROBLEM - nutcracker port on scb1002 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:28:50] PROBLEM - nutcracker port on scb2004 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:29:00] RECOVERY - puppet last run on mw2224 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:29:20] PROBLEM - nutcracker port on scb2002 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:29:20] PROBLEM - nutcracker port on scb1003 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:36:35] ACKNOWLEDGEMENT - nutcracker port on scb1001 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:36:36] ACKNOWLEDGEMENT - nutcracker port on scb1002 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:36:36] ACKNOWLEDGEMENT - nutcracker port on scb1003 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:36:36] ACKNOWLEDGEMENT - nutcracker port on scb1004 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:36:36] ACKNOWLEDGEMENT - nutcracker port on scb2001 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:36:36] ACKNOWLEDGEMENT - nutcracker port on scb2002 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:36:36] ACKNOWLEDGEMENT - nutcracker port on scb2003 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:36:37] ACKNOWLEDGEMENT - nutcracker port on scb2004 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:36:38] ACKNOWLEDGEMENT - nutcracker port on scb2005 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:36:38] ACKNOWLEDGEMENT - nutcracker port on scb2006 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Working on it! [15:37:48] oh noes all on fire! [15:52:25] 06Operations, 10Analytics, 15User-Elukey: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886#3113658 (10Nuria) ping @elukey please review & close if pertains [15:53:14] (03PS2) 10Muehlenhoff: Remove obsolete mediawiki::migrate class [puppet] - 10https://gerrit.wikimedia.org/r/356026 [15:56:02] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete mediawiki::migrate class [puppet] - 10https://gerrit.wikimedia.org/r/356026 (owner: 10Muehlenhoff) [16:01:02] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3298614 (10Nuria) [16:02:36] (03PS1) 10Giuseppe Lavagetto: profile::nutcracker: correctly handle monitoring, refactoring [puppet] - 10https://gerrit.wikimedia.org/r/356060 [16:04:26] (03CR) 10jerkins-bot: [V: 04-1] profile::nutcracker: correctly handle monitoring, refactoring [puppet] - 10https://gerrit.wikimedia.org/r/356060 (owner: 10Giuseppe Lavagetto) [16:07:52] (03PS2) 10Giuseppe Lavagetto: profile::nutcracker: correctly handle monitoring, refactoring [puppet] - 10https://gerrit.wikimedia.org/r/356060 [16:09:07] (03CR) 10jerkins-bot: [V: 04-1] profile::nutcracker: correctly handle monitoring, refactoring [puppet] - 10https://gerrit.wikimedia.org/r/356060 (owner: 10Giuseppe Lavagetto) [16:10:50] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2079237 [16:11:44] (03PS1) 10Volans: Puppet: disable stringified facts in prod [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) [16:13:07] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/6565/" [puppet] - 10https://gerrit.wikimedia.org/r/356060 (owner: 10Giuseppe Lavagetto) [16:13:26] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6563/" [puppet] - 10https://gerrit.wikimedia.org/r/356038 (https://phabricator.wikimedia.org/T166389) (owner: 10Filippo Giunchedi) [16:13:54] (03PS3) 10Giuseppe Lavagetto: profile::nutcracker: correctly handle monitoring, refactoring [puppet] - 10https://gerrit.wikimedia.org/r/356060 [16:16:03] (03PS2) 10Volans: Puppet: disable stringified facts in prod [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) [16:17:31] 06Operations, 10DBA: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3298671 (10jcrespo) a:03jcrespo [16:17:33] (03PS1) 10Gehel: logstash - start using elasticsearch-curator for indices cleanup [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) [16:17:44] 06Operations, 10DBA: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#1761444 (10jcrespo) p:05Low>03Normal [16:18:24] 06Operations, 10DBA: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#1761444 (10jcrespo) we got some progress: {P5501} [16:19:38] (03PS1) 10Gehel: logstash - cleanup dead code [puppet] - 10https://gerrit.wikimedia.org/r/356064 (https://phabricator.wikimedia.org/T166154) [16:19:51] (03PS3) 10Volans: Puppet: disable stringified facts in prod [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) [16:20:20] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:22] (03PS1) 10Faidon Liambotis: aptrepo: fix elastic.co's update filter [puppet] - 10https://gerrit.wikimedia.org/r/356065 [16:20:25] godog, gehel ^ [16:20:33] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::nutcracker: correctly handle monitoring, refactoring [puppet] - 10https://gerrit.wikimedia.org/r/356060 (owner: 10Giuseppe Lavagetto) [16:20:40] (03PS4) 10Giuseppe Lavagetto: profile::nutcracker: correctly handle monitoring, refactoring [puppet] - 10https://gerrit.wikimedia.org/r/356060 [16:20:45] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::nutcracker: correctly handle monitoring, refactoring [puppet] - 10https://gerrit.wikimedia.org/r/356060 (owner: 10Giuseppe Lavagetto) [16:20:55] paravoid: thanks! [16:22:16] (03CR) 10Filippo Giunchedi: [C: 031] "Nice, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/356065 (owner: 10Faidon Liambotis) [16:22:23] paravoid: thanks! [16:22:27] (03CR) 10Gehel: [C: 031] "LGTM (but my understanding of grep-dctrl is minimal)" [puppet] - 10https://gerrit.wikimedia.org/r/356065 (owner: 10Faidon Liambotis) [16:24:00] (03PS2) 10Faidon Liambotis: aptrepo: fix elastic.co's update filter [puppet] - 10https://gerrit.wikimedia.org/r/356065 [16:24:07] (03CR) 10Faidon Liambotis: [V: 032 C: 032] aptrepo: fix elastic.co's update filter [puppet] - 10https://gerrit.wikimedia.org/r/356065 (owner: 10Faidon Liambotis) [16:28:13] 06Operations, 15User-Joe: Sync internal nutcracker package with Debian package - https://phabricator.wikimedia.org/T166038#3298684 (10Joe) [16:28:33] (03PS1) 10Gehel: elasticsearch - configure logging for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/356068 (https://phabricator.wikimedia.org/T166154) [16:30:15] (03PS2) 10Faidon Liambotis: pmacct: fix firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/356036 [16:30:48] (03CR) 10Faidon Liambotis: [C: 032] pmacct: fix firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/356036 (owner: 10Faidon Liambotis) [16:31:36] (03CR) 10Ottomata: [C: 031] pmacct: update config, push output to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/356040 (owner: 10Faidon Liambotis) [16:32:09] jenkins backlogged again? [16:32:26] (03CR) 10Faidon Liambotis: [V: 032 C: 032] pmacct: fix firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/356036 (owner: 10Faidon Liambotis) [16:32:38] (03PS2) 10Faidon Liambotis: pmacct: update config, push output to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/356040 [16:33:13] (03Abandoned) 10Faidon Liambotis: network: add router loopbacks subnets [puppet] - 10https://gerrit.wikimedia.org/r/356035 (owner: 10Faidon Liambotis) [16:33:29] (03CR) 10Faidon Liambotis: [V: 032 C: 032] pmacct: update config, push output to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/356040 (owner: 10Faidon Liambotis) [16:35:50] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:07] (03PS2) 10Faidon Liambotis: raid/hpssacli: WARN on permanently disabled cache [puppet] - 10https://gerrit.wikimedia.org/r/354079 (https://phabricator.wikimedia.org/T163998) [16:38:10] (03PS2) 10Faidon Liambotis: raid/hpssacli: check for cable errors/no batteries [puppet] - 10https://gerrit.wikimedia.org/r/354080 (https://phabricator.wikimedia.org/T163998) [16:38:39] (03CR) 10Faidon Liambotis: [C: 032] raid/hpssacli: WARN on permanently disabled cache [puppet] - 10https://gerrit.wikimedia.org/r/354079 (https://phabricator.wikimedia.org/T163998) (owner: 10Faidon Liambotis) [16:38:42] (03CR) 10Faidon Liambotis: [C: 032] raid/hpssacli: check for cable errors/no batteries [puppet] - 10https://gerrit.wikimedia.org/r/354080 (https://phabricator.wikimedia.org/T163998) (owner: 10Faidon Liambotis) [16:39:53] * volans keeping an eye for warnings in the next ~30 min ;) [16:40:16] sigh jenkins [16:40:38] how bloody difficult is it to puppet-lint a change in a reasonable amount of time, like e.g. a few seconds [16:40:41] rather than a few minutes [16:40:43] seriously [16:44:00] (03PS2) 10Faidon Liambotis: authdns: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/354073 [16:49:20] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:53:39] PROBLEM - Host elastic2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:24] (03PS9) 10Ottomata: [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [16:59:41] 06Operations, 10DBA: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3298736 (10jcrespo) Some more progress, after removing galera stuff: P5501#29701 [17:02:59] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:05:14] (03PS5) 10Gehel: elasticsearch - silence some loggers for elastic 5.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/353105 [17:06:01] elastic2003 is taking more time than expected to reboot... but should be back in a few seconds [17:06:10] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3298758 (10elukey) The last missing piece that I didn't get before is the following (f... [17:06:19] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:07:07] (03CR) 10Gehel: [C: 032] elasticsearch - silence some loggers for elastic 5.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/353105 (owner: 10Gehel) [17:07:19] RECOVERY - Check whether ferm is active by checking the default input chain on restbase-dev1003 is OK: OK ferm input default policy is set [17:13:45] (03PS1) 10Volans: raid/hpssacli: allow NRPE to execute all commands [puppet] - 10https://gerrit.wikimedia.org/r/356070 (https://phabricator.wikimedia.org/T163998) [17:13:48] paravoid: ^^ [17:14:07] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3298767 (10elukey) One of the options could be to study a way to deploy nutcracker on... [17:14:39] RECOVERY - Host elastic2003 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [17:18:05] paravoid: I'm merging that to fix the issue, FYI [17:18:09] oops [17:18:12] yes please [17:18:20] (03CR) 10Faidon Liambotis: [C: 032] raid/hpssacli: allow NRPE to execute all commands [puppet] - 10https://gerrit.wikimedia.org/r/356070 (https://phabricator.wikimedia.org/T163998) (owner: 10Volans) [17:18:23] (03CR) 10Volans: [C: 032] raid/hpssacli: allow NRPE to execute all commands [puppet] - 10https://gerrit.wikimedia.org/r/356070 (https://phabricator.wikimedia.org/T163998) (owner: 10Volans) [17:18:27] lol [17:20:19] done [17:20:32] thx [17:20:50] (03CR) 10Faidon Liambotis: [C: 031] Puppet: disable stringified facts in prod [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [17:22:54] (03CR) 10Volans: "A quick puppet compiler result: https://puppet-compiler.wmflabs.org/6570/" [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [17:23:13] to test it in labs I'll cherry pick on my puppetmaster and verify is a noop [17:23:20] paravoid: ^^ [17:26:51] confirmed [17:27:12] ACKNOWLEDGEMENT - HP RAID on ms-be1020 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166517 [17:27:15] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T166517#3298781 (10ops-monitoring-bot) [17:27:19] (03CR) 10Volans: "Confirmed is a noop in labs instances cherry-picking it on my standalone puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [17:27:21] PROBLEM - HP RAID on db1094 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 [17:27:21] ACKNOWLEDGEMENT - HP RAID on db1094 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166518 [17:27:25] 06Operations, 10ops-eqiad: Degraded RAID on db1094 - https://phabricator.wikimedia.org/T166518#3298788 (10ops-monitoring-bot) [17:29:47] !log disabled puppet on tegmen and disabled raid_handler temporarily T163998 [17:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:58] T163998: check_hpssacli should report on battery failures and cache disabled - https://phabricator.wikimedia.org/T163998 [17:30:10] paravoid: they were not supposed to be WARNING those new checks? [17:30:51] AFAIK *all* the HP-based DBs have caching disabled because they have SSDs and the HPE SSD Smart Path enabled [17:31:57] that's supposed to be handled [17:32:08] yes, I added it to megacli [17:32:15] but only because it was easier [17:32:18] HP is pending [17:34:49] even others are lacking proper BBU monitoring still [17:35:12] not really? [17:35:16] paravoid: see T166518 and T166517 [17:35:18] T166518: Degraded RAID on db1094 - https://phabricator.wikimedia.org/T166518 [17:35:18] T166517: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T166517 [17:36:00] not sure what "not really?" means [17:37:10] volans: ms-be1020 is https://phabricator.wikimedia.org/T163777 [17:37:21] so not a false positive [17:37:35] db1094 is critical because "battery count" is listed as 0 [17:37:39] so there is really no BBU on this host [17:37:48] for that we need to pass --no-battery and it will silence it [17:38:06] or make it a little smarter and make it not warn about that if all of the disks are SSDs maybe? [17:38:29] no [17:38:31] jynus: "not really" means that the alerts for HPs should do the right thing [17:38:39] db1094 is a really a problem [17:38:47] oh ok, even better then [17:38:53] all those hosts should have a battery [17:38:59] not detecting it is a real problem [17:39:02] alright then :) [17:39:04] (not in pacticular [17:39:04] jynus: they were merged ~1h ago (the improved checks) [17:39:13] because they are not in use [17:39:14] so the new checks are already finding real problems, good :) [17:39:16] but that is concidental [17:39:48] speaking of db alerts, there is a check_failover test critical for dbproxy1010 [17:39:51] or maybe puppet hasn't run there yet? [17:39:59] yes, ask marostegui [17:40:12] !log re-enabled puppet on tegmen and re-enabled raid_handler T163998 [17:40:18] I think he is doing an import on one of the hosts [17:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:21] T163998: check_hpssacli should report on battery failures and cache disabled - https://phabricator.wikimedia.org/T163998 [17:40:33] but if we ack the alert, we will not catch the other host going down [17:40:41] volans: thanks for cleaning up my mess :) [17:40:49] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 26488 [17:41:04] yes that was "Stop MySQL labsdb1009 to take a backup - T153743" [17:41:04] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [17:41:11] (03PS1) 10Mobrovac: ChangeProp: Add Redis/Nutcracker connection info [puppet] - 10https://gerrit.wikimedia.org/r/356072 (https://phabricator.wikimedia.org/T161710) [17:41:12] paravoid: yw :) [17:41:37] so now we need to find a proper way to show what's wrong on the auto-generated tasks [17:41:48] is not clear for those HP checks [17:41:55] the problem is that what I did is opt-in [17:42:12] and I would assume noone except us (DBAs) has enabled it [17:42:20] because the data shown is different [17:42:28] 06Operations, 10MediaWiki-General-or-Unknown, 06Security-Team, 10Traffic, and 2 others: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3298800 (10matmarex) 05Open>03Resolved a:03fgiunchedi [17:43:36] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T166517#3298805 (10Volans) [17:43:38] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3298808 (10Volans) [17:44:10] jynus: the HP checks should do the right thing without a hiera parameter [17:44:23] and I think we could do the same for megacli as well, I'm not sure if making this configurable is such a good iea [17:45:47] 06Operations, 10ops-eqiad: Degraded RAID on db1094 - https://phabricator.wikimedia.org/T166518#3298811 (10Volans) [17:45:59] paravoid how do you manage that? [17:46:03] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1094 - https://phabricator.wikimedia.org/T166518#3298788 (10Volans) [17:46:26] the raid controller knows whether it should be WB or WT and whether it's currently degraded [17:46:41] I wasn't able to discover that in some cases [17:46:56] do you have an example I could check? [17:47:04] what I mean is, there are some clear cases- if the BBU is wrong [17:47:16] or if the temperature is too high [17:47:34] but last db1048 started a learning cycle despite it being disabled [17:47:43] what is "it"? [17:48:11] the learning cycles [17:48:15] being disabled [17:48:26] Help [17:48:32] then there is one thing- manual policy changes [17:49:01] my aim was to detect undesired policy changes, not monitor BBU [17:49:24] and I agree BBU must be monitored, but I do not see an easy way in all cases [17:50:06] do you have an example of a host that is in WT even though it's configured in WB? [17:50:10] for example, when I briefly enabled it globaly [17:50:18] we say analytics hosts being in wt [17:50:23] jynus, paravoid: I've opened T166519 to improve the raid handler side [17:50:23] T166519: Raid handler: manage new alarms - https://phabricator.wikimedia.org/T166519 [17:50:28] even if most others were in wb [17:50:37] paravoid: db1048 at times [17:50:55] and one analytics hosts, cannot rememver which one, elukey had a look at it [17:51:02] let me search the ticket numbers [17:51:27] db1048 is T160731 [17:51:28] T160731: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731 [17:52:02] many old servers apparently do a learning cycle even if disabled- or someone starts it without logging it [17:52:43] but not alwayts show battery degraded [17:53:26] analytics hosts is T166140 [17:53:26] T166140: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140 [17:53:33] paravoid^ [17:54:51] analytics[1033,1039].eqiad.wmnet, apparently [17:55:53] so there are 2 bad states here- WT if the BBU is right and we want speed [17:56:00] and WB if the BBU is bad [17:56:47] there were several hosts that were in wt, like tin or helium, and I didn't dare but to send an email commenting the status [18:01:57] (03CR) 10Mobrovac: "PCC bueno - https://puppet-compiler.wmflabs.org/6572/" [puppet] - 10https://gerrit.wikimedia.org/r/356072 (https://phabricator.wikimedia.org/T161710) (owner: 10Mobrovac) [18:18:08] (03PS2) 10Jcrespo: query-killer: Do not kill queries containing gtid_wait or DMLs [software] - 10https://gerrit.wikimedia.org/r/351796 [18:18:10] (03PS1) 10Jcrespo: dbtools: Update package for stretch and include systemd support [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) [18:25:47] (03CR) 10Jcrespo: "New systemd unit ( https://gerrit.wikimedia.org/r/#/c/356074/1/dbtools/mariadb.service ) - I have for now started to take the upstream one" [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [18:33:58] (03CR) 10Jcrespo: "Some open questions." (032 comments) [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [18:54:36] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3298909 (10mobrovac) >>! In T166341#3293262, @RobH wrote: > Is this something that you want done in next years budget, or is it now invalid? Please advise. He... [18:57:29] ACKNOWLEDGEMENT - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] Gehel restarting updater to catch up on updates [18:57:38] !log restarting wdqs-updater on wdqs1002 [18:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:19] PROBLEM - Host elastic2004 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:47] damn, elastic2004 is also taking more time than expected to reboot... having a look [19:03:49] RECOVERY - Host elastic2004 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [19:06:29] PROBLEM - Check systemd state on elastic2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:16:29] RECOVERY - Check systemd state on elastic2004 is OK: OK - running: The system is fully operational [19:20:43] (03PS1) 10Gehel: elasticsearch - ignore some warnings related to 5.3.2 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/356079 (https://phabricator.wikimedia.org/T163708) [19:29:55] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3298990 (10Ottomata) Apparently the stuff has to be actually received at the datacenter for it to count towards this year's budget. [19:39:30] (03CR) 10DCausse: [C: 031] elasticsearch - ignore some warnings related to 5.3.2 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/356079 (https://phabricator.wikimedia.org/T163708) (owner: 10Gehel) [19:47:50] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3298996 (10mobrovac) Well, that's unfortunate. We definitely want them under warranty as it's an important production system. We still want them, so I guess we'... [19:53:40] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3299002 (10Gehel) Looking at `dmesg`, there are a lot of warnings about CPU temperature and throttling: ``` [9098037.343804] CPU23: Package temperature above thresh... [19:55:52] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1002.eqiad.wmnet [19:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:21] !log removing wdqs1002 from LVS pending investigation of T166524 [19:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:30] T166524: high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524 [19:58:16] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3299007 (10Gehel) @Cmjohnson is this high temperature an indication that you should do some magic with thermal paste? [20:01:59] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [20:04:11] !log mobrovac@tin Started restart [zotero/translation-server@50f216a]: Memory at 50% [20:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:40] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:36:17] ema: hola! short question (i swear) [20:41:39] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:54:05] jouncebot: next [20:54:05] In 16 hour(s) and 5 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170530T1300) [21:28:24] (03PS1) 10Faidon Liambotis: labstore: remove TC=$(which tc) [puppet] - 10https://gerrit.wikimedia.org/r/356107 [21:28:26] (03PS1) 10Faidon Liambotis: labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 [21:28:28] (03PS1) 10Faidon Liambotis: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 [21:29:52] (03PS2) 10Faidon Liambotis: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 [21:42:59] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:49] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [21:51:45] 06Operations, 10TimedMediaHandler, 10media-storage: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3299091 (10Revent) @fgiunchedi Much smaller files have the same problem. See https://commons.wikimedia.org/wiki/File:Janusz_Cedro_live_in_conce... [21:52:54] (03PS1) 10Faidon Liambotis: Blacklist the parallel port kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/356118 [22:06:19] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:25:37] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3299139 (10Smalyshev) If it happens on a single server, not a load issue. Combined with warnings looks like hardware problem. I'll make a pass through the logs tomor... [22:30:49] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:30:49] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:30:49] PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:30:59] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [22:32:39] RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [22:32:39] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:32:40] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [22:34:19] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [23:13:49] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:18:49] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 281 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map