[00:35:52] PROBLEM - MariaDB Slave Lag: x1 on db2033 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.40 seconds [03:01:42] PROBLEM - mcrouter process on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:01:43] PROBLEM - Check size of conntrack table on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:01:53] PROBLEM - nutcracker process on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:02:02] PROBLEM - MD RAID on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:02:12] PROBLEM - nutcracker port on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:02:12] PROBLEM - dhclient process on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:02:32] PROBLEM - Check whether ferm is active by checking the default input chain on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:02:33] PROBLEM - DPKG on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:02:36] Hmm [03:02:42] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:02:42] PROBLEM - Disk space on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:02:43] PROBLEM - configured eth on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:04:16] mutante: herron ^^ [03:06:42] PROBLEM - puppet last run on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:08:53] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.6993 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [03:09:13] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.4703 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [03:09:22] PROBLEM - Check the NTP synchronisation status of timesyncd on mwmaint1002 is CRITICAL: Return code of 255 is out of bounds [03:10:02] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [03:10:23] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [03:16:23] RECOVERY - MD RAID on mwmaint1002 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [03:16:33] RECOVERY - nutcracker port on mwmaint1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [03:16:42] RECOVERY - dhclient process on mwmaint1002 is OK: PROCS OK: 0 processes with command name dhclient [03:16:52] RECOVERY - puppet last run on mwmaint1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [03:17:02] RECOVERY - Check whether ferm is active by checking the default input chain on mwmaint1002 is OK: OK ferm input default policy is set [03:17:02] RECOVERY - DPKG on mwmaint1002 is OK: All packages OK [03:17:12] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational [03:17:13] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK [03:17:13] RECOVERY - configured eth on mwmaint1002 is OK: OK - interfaces up [03:17:13] RECOVERY - mcrouter process on mwmaint1002 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter [03:17:22] RECOVERY - Check size of conntrack table on mwmaint1002 is OK: OK: nf_conntrack is 0 % full [03:17:32] RECOVERY - nutcracker process on mwmaint1002 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [03:34:42] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858.27 seconds [03:38:42] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops [03:39:23] RECOVERY - Check the NTP synchronisation status of timesyncd on mwmaint1002 is OK: OK: synced at Sat 2018-10-20 03:39:21 UTC. [03:42:12] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.604 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [03:42:33] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.419 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [03:42:43] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.3383 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [03:43:22] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [03:43:42] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [03:43:52] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [03:47:02] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 239.95 seconds [05:38:20] !log Force writeback on db2033 - T184888 [05:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:25] T184888: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888 [06:44:37] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) I did a short evaluation on provided VM and it looks like it behaves... [06:45:43] RECOVERY - MariaDB Slave Lag: x1 on db2033 is OK: OK slave_sql_lag Replication lag: 47.06 seconds [07:31:11] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: wdqs1009 - cannot create /var/log/wdqs/wdqs_autodeployment.log - https://phabricator.wikimedia.org/T206318 (10Smalyshev) [07:31:17] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: wdqs1009 - cannot create /var/log/wdqs/wdqs_autodeployment.log - https://phabricator.wikimedia.org/T206318 (10Smalyshev) p:05Triage>03High [07:34:02] (03PS1) 10Rxy: Add CentralAuth related permissions to stewards at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468691 (https://phabricator.wikimedia.org/T207531) [07:34:43] (03PS2) 10Rxy: Add CentralAuth related permissions to stewards at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468691 (https://phabricator.wikimedia.org/T207531) [08:08:43] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2391 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:11:02] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.6198 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:15:23] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2901 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:16:32] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [08:17:43] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [08:57:39] (03PS1) 10GTirloni: Initial import of shinken-2.0.3 [debs/shinken] - 10https://gerrit.wikimedia.org/r/468692 [08:58:31] (03PS1) 10Matěj Suchánek: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468693 [09:07:22] (03PS2) 10GTirloni: Initial import of shinken-2.0.3 [debs/shinken] - 10https://gerrit.wikimedia.org/r/468692 (https://phabricator.wikimedia.org/T204562) [09:11:56] (03PS3) 10GTirloni: Initial import of shinken-2.0.3 [debs/shinken] - 10https://gerrit.wikimedia.org/r/468692 (https://phabricator.wikimedia.org/T204562) [09:12:28] (03PS4) 10GTirloni: Initial import of shinken-2.0.3 [debs/shinken] - 10https://gerrit.wikimedia.org/r/468692 (https://phabricator.wikimedia.org/T204562) [09:21:08] (03CR) 10GTirloni: [V: 032 C: 032] Initial import of shinken-2.0.3 [debs/shinken] - 10https://gerrit.wikimedia.org/r/468692 (https://phabricator.wikimedia.org/T204562) (owner: 10GTirloni) [09:40:59] 10Operations, 10Puppet, 10Cloud-VPS: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10faidon) Ping? Could we setup a couple of puppetmasters in the new "cloudinfra" project and see where that leads us? I was previously told that this is probably a 1-2 weeks p... [09:41:41] 10Operations, 10MediaWiki-Page-deletion, 10Performance: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10TTO) [09:42:07] 10Operations, 10Puppet, 10Cloud-Services: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10faidon) [09:53:35] 10Operations, 10Cloud-VPS: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10faidon) p:05Triage>03Normal [09:56:07] 10Operations, 10Cloud-VPS: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10faidon) [09:56:08] (03CR) 10Framawiki: [C: 04-1] Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) (owner: 10Zoranzoki21) [10:57:15] 10Operations, 10Cloud-VPS: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10Krenair) > The only gotcha seems to be that the recursor runs some custom Lua code, that uses data generated by a Python script, that in turn seems to gather those from Nova's API. I'm not sure if that's acces... [11:06:33] 10Operations, 10Puppet, 10Cloud-Services: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) I imagine we'd need to issue every instance being moved a new puppet cert, as we presumably wouldn't want to hand the current labs puppetmaster CA over to the... [11:09:02] 10Operations, 10Cloud-VPS: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10Krenair) [11:09:38] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) [11:09:52] 10Operations, 10Puppet, 10Cloud-Services: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [11:10:34] Krenair: oh thanks! [11:10:44] I wanted to do that too, so +1 and really appreciated [11:10:56] paravoid, yeah I noticed a pattern of tasks emerging and thought I'd try to track them [11:11:00] are there any others floating around? [11:12:02] I searched for Cloud VPS/Cloud-Services tasks open and authored by you, the other ones that came up don't appear relevant [11:14:51] I don't think so [11:58:07] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10faidon) [12:32:00] (03PS1) 10Faidon Liambotis: designate/mitaka: remove typo'ed extension [puppet] - 10https://gerrit.wikimedia.org/r/468697 [12:38:36] (03CR) 10Alex Monk: "Looks like this goes back to the original designate puppetisation in Ic06414d1a942ad0ef9f1fd4be5f5bd002cd07cda so has probably always been" [puppet] - 10https://gerrit.wikimedia.org/r/468697 (owner: 10Faidon Liambotis) [12:43:19] (03CR) 10Hoo man: [C: 032] Add CentralAuth related permissions to stewards at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468691 (https://phabricator.wikimedia.org/T207531) (owner: 10Rxy) [12:44:32] (03Merged) 10jenkins-bot: Add CentralAuth related permissions to stewards at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468691 (https://phabricator.wikimedia.org/T207531) (owner: 10Rxy) [12:46:41] !log hoo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add CentralAuth related permissions to stewards at metawiki (T207531) (duration: 01m 09s) [12:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:45] T207531: Migrate global permissions "globalgroupmembership" and "globalgrouppermissions" to meta local definition - https://phabricator.wikimedia.org/T207531 [12:57:36] (03CR) 10jenkins-bot: Add CentralAuth related permissions to stewards at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468691 (https://phabricator.wikimedia.org/T207531) (owner: 10Rxy) [13:03:00] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [13:03:53] 10Operations, 10Cloud-VPS: Move labmon (Graphite, StatsD) into a Cloud VPS - https://phabricator.wikimedia.org/T207543 (10Krenair) [13:04:24] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [13:13:45] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [13:14:10] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [13:24:25] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [13:30:33] PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: / 1766 MB (3% inode=98%) [13:37:37] 10Operations, 10Horizon, 10Traffic, 10Upstream: Horizon Designate dashboard not allowing creation of NS records - https://phabricator.wikimedia.org/T204013 (10Krenair) I created an upstream patch, it got merged, now we just need to wait for OpenStack Stein to be released and upgrade to it. Also my original... [13:53:45] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.26/includes/auth/AuthManager.php: (no justification provided) (duration: 00m 55s) [13:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:57] (03PS1) 10Alex Monk: labs recursor: require interface alias before trying to start pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/468708 [14:14:54] (03CR) 10Alex Monk: "e.g." [puppet] - 10https://gerrit.wikimedia.org/r/468708 (owner: 10Alex Monk) [14:34:31] (03PS1) 10Alex Monk: labs recursor: Tell labsaliaser to use keystone public port instead of admin port [puppet] - 10https://gerrit.wikimedia.org/r/468709 [14:35:22] (03CR) 10jerkins-bot: [V: 04-1] labs recursor: Tell labsaliaser to use keystone public port instead of admin port [puppet] - 10https://gerrit.wikimedia.org/r/468709 (owner: 10Alex Monk) [14:36:39] (03PS2) 10Alex Monk: labsaliaser: use keystone public port instead of admin port [puppet] - 10https://gerrit.wikimedia.org/r/468709 (https://phabricator.wikimedia.org/T207533) [14:39:13] RECOVERY - Disk space on eventlog1002 is OK: DISK OK [14:40:07] working on it --^ [14:41:44] (03PS1) 10Alex Monk: labs dnsrecursor: require clientlib before labsaliaser [puppet] - 10https://gerrit.wikimedia.org/r/468714 [14:44:04] (03CR) 10Alex Monk: "I think this works in prod at the moment because the hosts include this through other profiles." [puppet] - 10https://gerrit.wikimedia.org/r/468714 (owner: 10Alex Monk) [14:54:51] (03CR) 10Alex Monk: "(See the Ferm::Rule resources at the bottom of modules/profile/manifests/openstack/base/keystone/service.pp)" [puppet] - 10https://gerrit.wikimedia.org/r/468709 (https://phabricator.wikimedia.org/T207533) (owner: 10Alex Monk) [15:01:13] (03CR) 10Alex Monk: [C: 031] designate/mitaka: remove typo'ed extension [puppet] - 10https://gerrit.wikimedia.org/r/468697 (owner: 10Faidon Liambotis) [15:27:24] (03PS1) 10Elukey: eventlogging::server: rotate logs on size (not only on time) [puppet] - 10https://gerrit.wikimedia.org/r/468718 [15:28:10] (03CR) 10jerkins-bot: [V: 04-1] eventlogging::server: rotate logs on size (not only on time) [puppet] - 10https://gerrit.wikimedia.org/r/468718 (owner: 10Elukey) [15:28:36] (03PS2) 10Elukey: eventlogging::server: rotate logs on size (not only on time) [puppet] - 10https://gerrit.wikimedia.org/r/468718 [15:28:47] oh noes I didn't see jenkins [15:28:51] another -1 coming [15:29:14] (03CR) 10jerkins-bot: [V: 04-1] eventlogging::server: rotate logs on size (not only on time) [puppet] - 10https://gerrit.wikimedia.org/r/468718 (owner: 10Elukey) [15:29:42] (03PS3) 10Elukey: eventlogging::server: rotate logs on size (not only on time) [puppet] - 10https://gerrit.wikimedia.org/r/468718 [15:30:32] (03PS2) 10Alex Monk: labs dnsrecursor: require clientlib before labsaliaser [puppet] - 10https://gerrit.wikimedia.org/r/468714 (https://phabricator.wikimedia.org/T207533) [15:30:57] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10Krenair) Created `labs-dnsrecursor-alex-test.openstack.eqiad.wmflabs` and applied `profile::openstack::base::pdns::recursor::service` as well as this hieradata to make it as similar to a... [15:31:57] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13119/" [puppet] - 10https://gerrit.wikimedia.org/r/468718 (owner: 10Elukey) [15:37:57] cc: mobrovac ---^ I applied a max size for the /var/log/eventlogging dir because of an issue with eventlog1002, but it applies also to kafka[1,2]*. I don't see any issue with it but lemme know otherwise [16:09:27] 10Operations, 10Puppet, 10Cloud-Services: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) >>! In T171188#4682291, @faidon wrote: > - Security model: I suppose that's cloudinfra, right? We need to address that regardless, as we move more services wit... [17:07:04] Reedy, can you look the stack trace in https://phabricator.wikimedia.org/T207553 ? [17:07:53] Error: 1048 Column 'afa_parameters' cannot be null (10.64.32.64) [17:08:24] ty [18:53:22] (03PS1) 10Niharika29: Deploy TemplateWizard everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468730 [19:51:02] PROBLEM - High lag on wdqs1003 is CRITICAL: 3601 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:53:12] PROBLEM - High lag on wdqs1003 is CRITICAL: 3611 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:53:48] * gehel is looking at wdqs1003 [19:54:50] !log depooling wdqs1003 to catch up on lag [19:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:32] PROBLEM - High lag on wdqs1003 is CRITICAL: 3632 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:39:06] 10Operations, 10MediaWiki-Page-deletion, 10Performance: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10Izno) This might possibly be caused by the work for {T198176}. [21:29:31] !log repooling wdqs1003 (still some lag, but 100[45] start to be impacted) [21:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:27] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.26/extensions/CentralAuth/: Update setEmail (duration: 00m 55s) [23:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:53] (03PS1) 10GTirloni: shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) [23:32:32] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [23:32:50] gtirloni, nice, did you get it working? [23:35:18] Krenair: yep, seems like it! I'll just push a small modification to that change though, forgot about cherrypy [23:35:36] gtirloni, is that based on what I did or did you start afresh? [23:36:14] note that it seemed at first like I got it working, didn't survive the first few restarts though :( [23:36:49] Krenair: yeah, the restarts caused a few problems with invalid directory permissions and whatnot.. it's weird. you're right, I need to double check that [23:37:57] at some point I just removed all packages and configs, ran Puppet and started over.. I was too deep in the rabbit hole [23:38:24] adding a comment to the phab task [23:38:53] was this all based on my attempt? [23:39:01] ah [23:39:04] ok [23:42:02] I started from your attempt yeah, trying things out.. but I had a few fresh starts that I took as a learning opportunity and it took me a while to get to the point where you left off (I cursed myself more than enough times for that) :-) [23:43:48] Krenair: my wife is looking at me with the angry eyes.. gotta shutdown the computer now. Thanks for all your work, much appreciated! (it was great that you narrowed it down to a single error). If you could give your feedback about the change, that'd be awesome.. if you want to add anything, feel free to do so too :) [23:44:12] heh [23:44:15] no worries [23:44:20] thanks