[00:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161122T0000). [00:00:05] tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:06] godog: as far as I can tell the silent failures are intentional - the script tries not to accidentally leak something via a stack trace (also, hundreds of stack-trace emails bouncing back might be bad) [00:00:27] robh: might aswell use yours as it's tagged witht he task [00:00:28] it probably needs proper monitoring but I'm not really sure how to set that up [00:00:37] (03Abandoned) 10Reedy: Comment out db1092 after crash till dba have looked at box [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322800 (owner: 10Reedy) [00:00:56] (03CR) 10Reedy: [C: 032] db1092 crashed and was offline for a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322801 (https://phabricator.wikimedia.org/T151272) (owner: 10RobH) [00:01:28] (03Merged) 10jenkins-bot: db1092 crashed and was offline for a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322801 (https://phabricator.wikimedia.org/T151272) (owner: 10RobH) [00:01:44] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [00:01:58] twentyafterfour: silent failure of the whole feature email to task, but yeah essentially monitoring too [00:01:59] ^ related? :P [00:03:02] !log reedy@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 after crash T151272 (duration: 00m 59s) [00:03:13] probably related, a bunch of "Could not wait for replica DBs to catch up to db1049" [00:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:24] T151272: db1092 crash - https://phabricator.wikimedia.org/T151272 [00:03:47] 06Operations, 10DBA: db1092 crash - https://phabricator.wikimedia.org/T151272#2812819 (10Reedy) [00:05:30] Reedy: just to confirm since i saw you merging you handled merge? [00:05:38] i dont wanna leave it in a limbo state assuming =] [00:05:41] robh: yeah, it's deployed [00:05:45] cool, thanks! [00:06:44] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [00:07:04] Co-incidence [00:07:07] OR IS IT [00:09:30] 06Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2812828 (10RobH) [00:16:18] 06Operations, 10ops-codfw: cp4008 power supply failure - https://phabricator.wikimedia.org/T151275#2812855 (10RobH) [00:19:30] 06Operations, 10ops-codfw: cp4008 power supply failure - https://phabricator.wikimedia.org/T151275#2812874 (10RobH) [00:19:32] 06Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2812875 (10RobH) [00:19:34] 06Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#2812873 (10RobH) [00:20:12] (03CR) 10Filippo Giunchedi: [C: 032] Phabricator: Unbreak incoming email and harden config file permissions. [puppet] - 10https://gerrit.wikimedia.org/r/322791 (https://phabricator.wikimedia.org/T151229) (owner: 1020after4) [00:20:55] twentyafterfour: can't submit https://gerrit.wikimedia.org/r/#/c/322791/ since it has https://gerrit.wikimedia.org/r/#/c/322781 as its parent :( [00:21:19] (03PS3) 10Reedy: Phabricator: Unbreak incoming email and harden config file permissions. [puppet] - 10https://gerrit.wikimedia.org/r/322791 (https://phabricator.wikimedia.org/T151229) (owner: 1020after4) [00:21:25] godog: ^ fixed [00:21:30] press cherry pick, type prod [00:21:46] ah, thanks Reedy #TIL [00:21:49] :D [00:22:01] it's non obvious that you can cherry pick back onto the same branch... [00:23:04] 06Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#2812879 (10RobH) Since the PDU replacement, it turns out that cp4008 is the likely cause. > We replaced the PDU in your cab but it seems the culprit was either one of two power supplies that are out on your uni... [00:25:51] indeed [00:35:33] !log bblack@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4008.ulsfo.wmnet [00:35:47] !log depooled cp4008 (cache_text ulsfo) - T151275 [00:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:13] T151275: cp4008 power supply failure - https://phabricator.wikimedia.org/T151275 [00:40:37] (03PS1) 10Filippo Giunchedi: phabricator: fix exit vs sys.exit in phab_epipe.py [puppet] - 10https://gerrit.wikimedia.org/r/322804 [00:43:38] (03CR) 10Filippo Giunchedi: [C: 032] phabricator: fix exit vs sys.exit in phab_epipe.py [puppet] - 10https://gerrit.wikimedia.org/r/322804 (owner: 10Filippo Giunchedi) [00:43:43] (03PS2) 10Filippo Giunchedi: phabricator: fix exit vs sys.exit in phab_epipe.py [puppet] - 10https://gerrit.wikimedia.org/r/322804 [00:45:04] !log cr[12]-ulsfo - added metric 15 to lvs4002 in policy LVS_import - T151273 [00:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:26] T151273: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273 [00:47:29] !log cr[12]-ulsfo - added metric 15 to lvs4002 in policy LVS_import (for real this time) - T151273 [00:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:38] 06Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2812828 (10BBlack) Above was this on both ulsfo routers: ``` set policy-options policy-statement LVS_import term lvs4002_T151273 from protocol bgp neighbor 10.128.0.12 set policy-options policy-statement LVS_imp... [00:51:19] 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2812895 (10BBlack) [00:52:01] !log reboot ms-be2025 T151201 [00:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:24] T151201: ms-be2025 controller failure - https://phabricator.wikimedia.org/T151201 [00:52:32] (03PS1) 10RobH: Revert "depooling ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/322806 [00:52:46] (03CR) 10RobH: [C: 032] Revert "depooling ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/322806 (owner: 10RobH) [00:54:04] RECOVERY - Host ms-be2025 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [00:54:27] 06Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#2812899 (10RobH) @bblack went ahead and depooled cp4008 from service and set lvs4002 as non-preferred over lvs4004 (so unless 4004 fails, it'll be primary and leave 2002 as secondary.) the lvs preference had to b... [00:54:56] twentyafterfour: wasnt here earlier, saw it now.. then that it got merged meanwhile. all good? [00:56:21] yeah we're back [00:56:29] mutante: it's all good, thank you [00:56:32] and thank you godog [00:57:20] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Deploy federation for Prometheus - https://phabricator.wikimedia.org/T150486#2812900 (10fgiunchedi) [00:57:24] thanks godog and 20 [01:00:36] 06Operations, 10ops-codfw, 06DC-Ops: ms-be2025 controller failure - https://phabricator.wikimedia.org/T151201#2812902 (10fgiunchedi) a:03Papaul Indeed looks like battery/cache failure, I've rebooted ms-be2025 and it came up fine modulo the disclaimer above for POST error. hpssacli: ``` Smart Array P840 in... [01:02:25] PROBLEM - HP RAID on ms-be2025 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [01:03:19] (03PS1) 10Eevans: enable instance restbase2011-a.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322807 (https://phabricator.wikimedia.org/T151086) [01:05:05] RECOVERY - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.188 port 9042 [01:05:44] (03CR) 10Dzahn: [C: 031] "SETENV and NOSETENV" [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) (owner: 1020after4) [01:05:45] 06Operations, 10ops-codfw: cp4008 power supply failure - https://phabricator.wikimedia.org/T151275#2812855 (10Cmjohnson) This system is out of warranty. [01:06:34] 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2812828 (10Cmjohnson) This system is out of warranty. [01:06:35] (03CR) 10Eevans: [C: 04-1] "I think there may be something else required before setting up the first instance." [puppet] - 10https://gerrit.wikimedia.org/r/322807 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [01:07:12] urandom: good timing, was about do ask/just do it [01:07:25] PROBLEM - NTP on ms-be2025 is CRITICAL: NTP CRITICAL: Offset unknown [01:07:30] mutante: is that gerrit enough? [01:07:40] (03PS2) 1020after4: Allow aklapper to `sudo -E` phabricator admin utilities [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) [01:07:55] i don't have ssh access to that machine, and was trying to remember if there was something else needed the first time [01:08:23] i feel like i've asked, and had answered, this question once before... [01:08:36] urandom: last time we needed to apply the puppet role on it to get you access. let's see [01:08:57] it was like activating also gave you access , wasnt it [01:09:01] checks [01:09:35] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:11:25] restbase20[01][0-9] already has restbase::server [01:12:02] but.. 2011 says puppet didnt run in a while [01:12:35] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [1800.0] [01:14:09] !log restbast2011 - enabling puppet, running puppet, seeing error about missing secret, disabling puppet again [01:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:33] :P urandom: it needs a secret restbase2011.kst file [01:14:48] !? [01:15:07] so when puppet runs there we get an error [01:15:10] oh, so .... [01:15:11] yeah [01:15:16] i was about to say... [01:15:17] that tells us that a restbase2011.kst is missing [01:15:23] that it tries to find in the private repo [01:15:28] right [01:15:31] and last time filippo added them [01:15:38] yeah [01:15:42] so i dont know yet how you create those [01:15:57] and since puppet doesnt run [01:16:00] it doesnt create your user [01:16:04] https://wikitech.wikimedia.org/wiki/Cassandra#Installing_and_generating_certificates [01:17:04] odd that would be needed though, only the instance certs should be needed [01:17:18] i.e. -a -b -c [01:17:24] and I have crappy internet ATM, grrr [01:18:23] mutante: what is the context there? what is puppet working on when it errors? [01:18:41] godog: are we generating a host.kst file for some reason? [01:20:06] not that I remember no [01:20:09] urandom: it can't retrieve the catalog from the master, because a 400 error happens because the file is missing, so it skips the entire run [01:20:24] k, wierd. [01:20:41] ¯\_(ツ)_/¯ [01:20:51] it says that it tries to find it in cassandra/services/restbase2011/ [01:22:25] that happens because https://gerrit.wikimedia.org/r/#/c/322807/ isn't merged yet [01:22:35] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [01:22:36] i.e. there's only the default cassandra instance [01:22:46] auh [01:22:55] file { "${config_directory}/tls/server.key": [01:22:56] content => secret("cassandra/${tls_cluster_name}/${tls_hostname}/${tls_hostname}.kst"), [01:23:05] in instance.pp [01:23:05] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:23:09] that actually makes sense [01:23:55] I think https://gerrit.wikimedia.org/r/#/c/322807 can go as is tho and it will do the right thing [01:23:58] mutante: yeah, without an instance it tries to start the 'default' instance, and that isn't configured (because we don't want that) [01:24:12] aha [01:24:18] a bit unfortunate that it results in puppet not running heh [01:24:24] yeah [01:24:42] (03CR) 10Dzahn: [C: 032] enable instance restbase2011-a.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322807 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [01:24:46] well then [01:24:49] not ideal behavior there, but it makes sense, in a way [01:25:54] (03PS3) 10Filippo Giunchedi: phabricator: fix exit vs sys.exit in phab_epipe.py [puppet] - 10https://gerrit.wikimedia.org/r/322804 [01:26:31] !log restbase2011 enabling puppet, initial run after activation with gerrit 322807 [01:26:40] it's installing jdk8 and stuff [01:26:48] ferm rules now [01:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:42] starting cassandra-metrics-collector service .. [01:27:47] twentyafterfour: btw I took a quick look at phab_epipe and it looks like it ignores the exit status of phab's php ingester, which is probably why it swallowed the message vs bouncing it [01:28:46] urandom: try SSH now, it created your user [01:28:54] mutante: yup [01:28:57] there is one failed dependency with scap [01:29:01] but besides that it finished all [01:29:15] runs it one more time [01:30:03] it changed some permissions starts cassandra instances now [01:30:05] RECOVERY - puppet last run on restbase2011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [01:30:11] the scap deploy service ..and all is fine [01:30:28] UJ restbase2011-a.codfw.wmnet 377.18 KB 256 ... [01:30:56] so it's merge, 2 puppet runs and then fine. not bad at all [01:31:15] :) [01:31:55] mutante: and: sudo service cassandra-a stop && sudo rm -rf /srv/cassandra-a/* && sudo service cassandra-a start [01:32:17] because it the abortive attempt at running creates corrupt state [01:32:33] because it tries to startup before it has the scap deployed compaction strategy [01:32:50] ...and this is only for the first instance, the others only require a merge! [01:32:57] so...glass half full... [01:33:03] interesting, we have a new incinga check for "systemd state" [01:33:49] urandom: ahhh. ok. i see *nod* [01:37:25] RECOVERY - NTP on ms-be2025 is OK: NTP OK: Offset -0.007744818926 secs [01:38:07] (03PS7) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [01:38:22] (03CR) 10Dzahn: [C: 04-1] base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [01:38:35] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [01:40:16] !log T151086: RESTBase: Starting 'a' instance Cassandra cleanups, rack 'b', codfw [01:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:38] T151086: RESTBase cluster expansion - https://phabricator.wikimedia.org/T151086 [01:49:42] (03PS8) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [01:50:08] PROBLEM - MariaDB Slave Lag: s5 on db1092 is CRITICAL: CRITICAL slave_sql_lag could not connect [01:50:32] sigh [01:50:55] silencing again, it is depooled anyways [01:51:05] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:51:49] ACKNOWLEDGEMENT - MariaDB Slave IO: s5 on db1092 is CRITICAL: CRITICAL slave_io_state could not connect Filippo Giunchedi crash, depooled - T151272 [01:51:52] ACKNOWLEDGEMENT - MariaDB Slave Lag: s5 on db1092 is CRITICAL: CRITICAL slave_sql_lag could not connect Filippo Giunchedi crash, depooled - T151272 [01:51:55] ACKNOWLEDGEMENT - MariaDB Slave SQL: s5 on db1092 is CRITICAL: CRITICAL slave_sql_state could not connect Filippo Giunchedi crash, depooled - T151272 [01:51:58] ACKNOWLEDGEMENT - mysqld processes on db1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Filippo Giunchedi crash, depooled - T151272 [01:52:15] pages. blerg [01:52:32] guess I'll try to sleep [01:52:36] aye, sorry about that apergos [01:52:45] happens [01:55:36] ACKNOWLEDGEMENT - HP RAID on ms-be2025 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. Filippo Giunchedi controller T151201 [01:57:03] (03PS9) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [01:59:22] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.151, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [02:02:02] PROBLEM - Restbase root url on restbase2011 is CRITICAL: connect to address 10.192.32.151 and port 7231: Connection refused [02:02:07] ACKNOWLEDGEMENT - Restbase root url on restbase2011 is CRITICAL: connect to address 10.192.32.151 and port 7231: Connection refused daniel_zahn setup in progress [02:02:07] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.32.152:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.152 and port 9042: Connection refused daniel_zahn setup in progress [02:02:07] ACKNOWLEDGEMENT - restbase endpoints health on restbase2011 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.151, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) daniel_zahn setup in progress [02:27:59] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 10m 20s) [02:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:18] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Nov 22 02:32:17 UTC 2016 (duration 4m 18s) [02:32:22] RECOVERY - HP RAID on ms-be2025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller [02:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:13] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2813006 (10Marshallsumter) I agree to closing! [02:55:45] (03CR) 10Dzahn: "alright, after more changes and rebases let's see what happens here file by file:" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [02:56:47] 06Operations, 13Patch-For-Review: Remote IPMI doens't work for ~17% of the fleet - https://phabricator.wikimedia.org/T150160#2775695 (10Dzahn) amended https://gerrit.wikimedia.org/r/#/c/320246/9 [03:18:36] (03CR) 10Yuvipanda: [C: 032] webservice: guard against PYTHONPATH munging in caller's environment [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/322799 (https://phabricator.wikimedia.org/T147350) (owner: 10BryanDavis) [03:19:07] (03Merged) 10jenkins-bot: webservice: guard against PYTHONPATH munging in caller's environment [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/322799 (https://phabricator.wikimedia.org/T147350) (owner: 10BryanDavis) [03:28:42] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 758.20 seconds [03:38:42] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 271.22 seconds [03:49:02] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [03:51:22] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [03:53:12] PROBLEM - puppet last run on db1081 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:52] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [04:21:02] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 50.00% above the threshold [1.0] [04:21:02] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [04:22:12] RECOVERY - puppet last run on db1081 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [05:15:47] (03PS1) 10Alex Monk: Template out reference to deployment.eqiad.wmnet in inactive.motd [puppet] - 10https://gerrit.wikimedia.org/r/322825 (https://phabricator.wikimedia.org/T146505) [05:16:22] 07Puppet, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy, and 2 others: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2813287 (10Krenair) a:03Krenair [06:13:12] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:13:22] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:25:32] RECOVERY - Check systemd state on mw2085 is OK: OK - running: The system is fully operational [06:25:32] RECOVERY - Check systemd state on mw2082 is OK: OK - running: The system is fully operational [06:25:52] RECOVERY - Check systemd state on mw2080 is OK: OK - running: The system is fully operational [06:25:52] RECOVERY - Check systemd state on mw2084 is OK: OK - running: The system is fully operational [06:26:02] RECOVERY - Check systemd state on mw2081 is OK: OK - running: The system is fully operational [06:26:13] RECOVERY - Check systemd state on mw2083 is OK: OK - running: The system is fully operational [06:42:12] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:42:22] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:45:52] PROBLEM - Check systemd state on mw2080 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:46:12] PROBLEM - Check systemd state on mw2083 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:46:32] PROBLEM - Check systemd state on mw2082 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:48:52] PROBLEM - Check systemd state on mw2084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:49:32] PROBLEM - Check systemd state on mw2085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:50:06] (03PS1) 10Dzahn: installserver: move http to own class (kill carbon WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322829 [06:50:12] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:53:18] (03PS2) 10Dzahn: installserver: move http to own class (kill carbon WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322829 [06:54:06] !log Stopping Replication on db1095 for maintenance - T150960 [06:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:31] T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960 [06:56:02] PROBLEM - Check systemd state on mw2081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:57:42] !log Stopping Replication on db2057 for maintenance - T150960 [06:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:47] 06Operations, 10DBA: db1092 crash - https://phabricator.wikimedia.org/T151272#2812799 (10Marostegui) Thanks for Robh for taking care of this. I am going to have a look to see if we can find why it crashed. [07:02:02] (03PS1) 10Dzahn: installserver: split squid proxy to own class (kill carbon WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322830 [07:02:24] 06Operations, 10DBA: db1092 crash - https://phabricator.wikimedia.org/T151272#2813334 (10Marostegui) a:03Marostegui [07:03:27] (03CR) 10jenkins-bot: [V: 04-1] installserver: split squid proxy to own class (kill carbon WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322830 (owner: 10Dzahn) [07:19:12] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:20:56] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2813337 (10hashar) [07:23:07] !log Reboot db1092 for RAID controller upgrade - T151272 [07:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:29] T151272: db1092 crash - https://phabricator.wikimedia.org/T151272 [07:28:56] RECOVERY - mysqld processes on db1092 is OK: PROCS OK: 1 process with command name mysqld [07:30:16] RECOVERY - MariaDB Slave SQL: s5 on db1092 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:30:36] RECOVERY - MariaDB Slave IO: s5 on db1092 is OK: OK slave_io_state Slave_IO_Running: Yes [07:30:57] 06Operations, 10DBA: db1092 crash - https://phabricator.wikimedia.org/T151272#2813348 (10Marostegui) Error from yesterday ``` /system1/log1/record12 Targets Properties number=12 severity=Caution date=11/21/2016 time=23:52 description=Option ROM POST Error: 1719-Slot 1 Drive Array - A c... [07:30:58] <_joe_> 3 sms? [07:31:14] _joe_: Yeah, I was also wondering why they arrive if the server is silenced? [07:55:16] RECOVERY - MariaDB Slave Lag: s5 on db1092 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:00:39] 06Operations, 10ops-codfw, 10DBA: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2813371 (10Marostegui) 05Open>03Resolved All good now - thank you! ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physic... [08:06:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322833 (https://phabricator.wikimedia.org/T147305) [08:07:27] (03PS2) 10Marostegui: db-eqiad.php: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322833 (https://phabricator.wikimedia.org/T147305) [08:08:14] (03CR) 10Hashar: [C: 031] docker: apt repo before installing package [puppet] - 10https://gerrit.wikimedia.org/r/321485 (owner: 10Dduvall) [08:08:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322833 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [08:08:53] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322833 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [08:09:20] (03PS6) 10Hashar: contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [08:10:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1081 - T147305 (duration: 00m 50s) [08:10:43] (03CR) 10Hashar: [C: 031] "I have removed the [WIP] prefix. jenkins-deploy is the wmflabs user used by the Jenkins master to connect to an instance and spawn the J" [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [08:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:02] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [08:14:09] !log Deploy ALTER table db1081 commonswiki.revision - T147305 [08:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:36] elukey ^ [08:14:38] :) [08:20:22] 06Operations, 10ops-eqiad, 06DC-Ops: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2813405 (10Joe) [08:20:32] 06Operations, 10ops-eqiad, 06DC-Ops: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2813418 (10Joe) p:05Triage>03Unbreak! [08:20:46] 06Operations, 10ops-eqiad, 06DC-Ops, 15User-Joe: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2813420 (10Joe) [08:22:24] 06Operations, 10Monitoring, 13Patch-For-Review: Fix up icinga puppetization - https://phabricator.wikimedia.org/T110893#2813422 (10Joe) [08:22:26] 06Operations, 10Monitoring, 13Patch-For-Review, 15User-Joe: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2813421 (10Joe) 05Open>03Resolved [08:30:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322835 [08:31:43] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322835 (owner: 10Marostegui) [08:32:13] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322835 (owner: 10Marostegui) [08:33:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081 - T147305 (duration: 00m 54s) [08:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:08] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [08:43:43] (03PS1) 10Giuseppe Lavagetto: Add mwtest1001,2 aliases for test mediawiki hosts [dns] - 10https://gerrit.wikimedia.org/r/322836 (https://phabricator.wikimedia.org/T151295) [08:44:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Add mwtest1001,2 aliases for test mediawiki hosts [dns] - 10https://gerrit.wikimedia.org/r/322836 (https://phabricator.wikimedia.org/T151295) (owner: 10Giuseppe Lavagetto) [08:45:26] (03CR) 10Hashar: [V: 031] "Puppet compile for fluorine.eqiad.wmnet https://puppet-compiler.wmflabs.org/4625/ show that this change is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/322639 (https://phabricator.wikimedia.org/T151169) (owner: 10Hashar) [08:50:32] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:54:37] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322837 (https://phabricator.wikimedia.org/T147305) [08:56:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322837 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [08:56:58] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322837 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [08:58:03] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2813449 (10Joe) [08:58:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1084 - T147305 (duration: 00m 53s) [08:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:10] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [09:00:57] !log Deploy ALTER table db1084 commonswiki.revision - T147305 [09:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:47] (03PS1) 10Giuseppe Lavagetto: debug_proxy: add mwtest1001,1002, aliases of mw1170,mw1171 [puppet] - 10https://gerrit.wikimedia.org/r/322838 (https://phabricator.wikimedia.org/T151295) [09:07:36] (03CR) 10jenkins-bot: [V: 04-1] debug_proxy: add mwtest1001,1002, aliases of mw1170,mw1171 [puppet] - 10https://gerrit.wikimedia.org/r/322838 (https://phabricator.wikimedia.org/T151295) (owner: 10Giuseppe Lavagetto) [09:11:56] (03PS2) 10Giuseppe Lavagetto: debug_proxy: add mwtest1001,1002, aliases of mw1170,mw1171 [puppet] - 10https://gerrit.wikimedia.org/r/322838 (https://phabricator.wikimedia.org/T151295) [09:12:47] (03CR) 10jenkins-bot: [V: 04-1] debug_proxy: add mwtest1001,1002, aliases of mw1170,mw1171 [puppet] - 10https://gerrit.wikimedia.org/r/322838 (https://phabricator.wikimedia.org/T151295) (owner: 10Giuseppe Lavagetto) [09:14:19] (03PS2) 10Jcrespo: Depool db1059 to apply schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322699 (https://phabricator.wikimedia.org/T151029) [09:16:34] (03CR) 10Jcrespo: [C: 032] Depool db1059 to apply schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322699 (https://phabricator.wikimedia.org/T151029) (owner: 10Jcrespo) [09:17:14] (03Merged) 10jenkins-bot: Depool db1059 to apply schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322699 (https://phabricator.wikimedia.org/T151029) (owner: 10Jcrespo) [09:18:32] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:19:25] isn't puppet (I assume it is the proxy) failing more often than usual? [09:19:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1059 T151029 (duration: 00m 49s) [09:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:59] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [09:23:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322843 [09:25:01] (03PS3) 10Giuseppe Lavagetto: debug_proxy: add mwtest1001,1002, aliases of mw1170,mw1171 [puppet] - 10https://gerrit.wikimedia.org/r/322838 (https://phabricator.wikimedia.org/T151295) [09:35:20] (03CR) 10Giuseppe Lavagetto: [C: 032] debug_proxy: add mwtest1001,1002, aliases of mw1170,mw1171 [puppet] - 10https://gerrit.wikimedia.org/r/322838 (https://phabricator.wikimedia.org/T151295) (owner: 10Giuseppe Lavagetto) [09:36:30] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2813526 (10Gilles) The client doesn't necessarily have to advertise what it wants, but then if it doesn't tell us anything, we would have to be conserv... [09:36:48] !log performing blocking schema change on db1059 (depooled) T151029 [09:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:07] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [09:40:12] PROBLEM - Check systemd state on hassium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:40:49] <_joe_> fucking nginx [09:44:08] !log performing blocking schema change on db1084 (depooled) T151029 [09:44:12] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:29] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [09:45:12] RECOVERY - Check systemd state on hassium is OK: OK - running: The system is fully operational [09:46:27] !log Replaced slow Jenkins job operations-puppet-puppetlint-strict in favor of using 'rake test' which runs puppet-lint solely against files changed in HEAD https://gerrit.wikimedia.org/r/322839 [09:46:43] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/322639 (https://phabricator.wikimedia.org/T151169) (owner: 10Hashar) [09:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:32] (03PS1) 10Jcrespo: Revert "Depool db1059 to apply schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322849 [09:55:12] PROBLEM - Check systemd state on hassium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:58:12] RECOVERY - Check systemd state on hassium is OK: OK - running: The system is fully operational [09:58:52] (03PS1) 10Giuseppe Lavagetto: debug_proxy: add map_hash_bucket_size [puppet] - 10https://gerrit.wikimedia.org/r/322851 [09:59:39] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] debug_proxy: add map_hash_bucket_size [puppet] - 10https://gerrit.wikimedia.org/r/322851 (owner: 10Giuseppe Lavagetto) [10:00:59] !log starting elasticsearch cluster restart for JDK and nginx upgrade [10:01:06] !log starting elasticsearch codfw cluster restart for JDK and nginx upgrade [10:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:25] (03PS2) 10Jcrespo: Revert "db-eqiad.php: Depool db1084 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322843 (owner: 10Marostegui) [10:05:55] (03CR) 10Jcrespo: [C: 032] Revert "db-eqiad.php: Depool db1084 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322843 (owner: 10Marostegui) [10:07:29] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2627494 (10Volans) Additional cronspam from the same script with different message: ``` Set $wgShowExceptionDetails = true; in LocalSettings.php to show detailed debugging information. ``` The cron is: ``` ### from email subject: C... [10:13:12] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:15:11] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2813589 (10Volans) Also a different one was triggered: ``` ### From email subject: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null ### From terbium crontab: # Puppet Name: cleanup_up... [10:16:54] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2627494 (10Peachey88) >>! In T145360#2813572, @Volans wrote: > Additional cronspam from the same script with different message: not related, other breakage {T148957} [10:18:31] (03PS2) 10Jcrespo: Revert "Depool db1059 to apply schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322849 [10:22:14] (03PS1) 10Marostegui: sanitarium2.my.cnf.erb: Enable triggers in RBR [puppet] - 10https://gerrit.wikimedia.org/r/322855 (https://phabricator.wikimedia.org/T150960) [10:22:43] 06Operations: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#2813597 (10hashar) 05Open>03declined That is done in the repository puppet-lint configuration file: ``` name=/.puppet-lint.rc --no-autoloader_layout-check ``` [10:23:41] (03CR) 10Gilles: "I'm not familiar with that puppet compiler tool, does it mean that the changeset is deployed on those two production machine?" [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [10:24:24] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2813599 (10hashar) [10:25:34] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2133125 (10hashar) Updated task description based on the current `/.puppet-lint.rc` [10:25:50] (03PS4) 10Addshore: Enable RevisionSlider (non BetaFeature) on de,ar,hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319541 (https://phabricator.wikimedia.org/T148646) [10:27:18] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1059 to apply schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322849 (owner: 10Jcrespo) [10:27:21] (03CR) 10WMDE-Fisch: [C: 031] ":-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319541 (https://phabricator.wikimedia.org/T148646) (owner: 10Addshore) [10:28:04] (03Merged) 10jenkins-bot: Revert "Depool db1059 to apply schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322849 (owner: 10Jcrespo) [10:30:31] (03CR) 10Jcrespo: sanitarium2.my.cnf.erb: Enable triggers in RBR (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322855 (https://phabricator.wikimedia.org/T150960) (owner: 10Marostegui) [10:34:16] (03PS2) 10Marostegui: sanitarium2.my.cnf.erb: Enable triggers in RBR [puppet] - 10https://gerrit.wikimedia.org/r/322855 (https://phabricator.wikimedia.org/T150960) [10:34:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1059 & db1084 T151029 (duration: 00m 51s) [10:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:41] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [10:40:53] 06Operations, 10ops-eqiad, 06DC-Ops, 15User-Joe: Hardware decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151303#2813646 (10Joe) [10:41:05] 06Operations, 10ops-eqiad, 06DC-Ops, 15User-Joe: Hardware decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151303#2813661 (10Joe) a:05Joe>03None [10:42:49] (03PS1) 10Jcrespo: Depool db1091 to apply blocking schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322858 (https://phabricator.wikimedia.org/T151029) [10:43:16] (03PS1) 10Giuseppe Lavagetto: site.pp: convert mw1017,mw1099 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/322860 (https://phabricator.wikimedia.org/T151295) [10:43:42] PROBLEM - mediawiki-installation DSH group on mw1017 is CRITICAL: Host mw1017 is not in mediawiki-installation dsh group [10:43:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] site.pp: convert mw1017,mw1099 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/322860 (https://phabricator.wikimedia.org/T151295) (owner: 10Giuseppe Lavagetto) [10:43:52] PROBLEM - mediawiki-installation DSH group on mw1099 is CRITICAL: Host mw1099 is not in mediawiki-installation dsh group [10:43:57] <_joe_> known [10:44:32] 06Operations: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304#2813667 (10Volans) [10:51:12] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:51:23] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on several images Mediawiki renders succesfully - https://phabricator.wikimedia.org/T150746#2813718 (10Gilles) I've confirmed that the 504 is coming from nginx's 60s limit. And I see that mediawiki on the first example takes 87 seconds to render the t... [10:52:02] (03Abandoned) 10Arseny1992: Enable RevisionSlider (non betafeature) on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321103 (https://phabricator.wikimedia.org/T150573) (owner: 10Arseny1992) [11:00:21] (03PS1) 10Alexandros Kosiaris: Introduce mwdebug1001, mwdebug1002 [dns] - 10https://gerrit.wikimedia.org/r/322863 [11:03:12] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [11:03:33] (03CR) 10Jcrespo: [C: 032] Depool db1091 to apply blocking schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322858 (https://phabricator.wikimedia.org/T151029) (owner: 10Jcrespo) [11:04:21] (03Merged) 10jenkins-bot: Depool db1091 to apply blocking schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322858 (https://phabricator.wikimedia.org/T151029) (owner: 10Jcrespo) [11:07:02] 06Operations, 06Labs: create-dbusers service failing on labstore1004 - https://phabricator.wikimedia.org/T151310#2813785 (10Volans) [11:07:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1091 (duration: 00m 57s) [11:07:25] (03PS2) 10Alexandros Kosiaris: Introduce mwdebug1001, mwdebug1002 [dns] - 10https://gerrit.wikimedia.org/r/322863 [11:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:26] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on several images Mediawiki renders succesfully - https://phabricator.wikimedia.org/T150746#2813800 (10Gilles) A lot of them in that list turn out to be very fast to render both on mediawiki and thumbor. This would suggest intermittent issues with thu... [11:15:34] !log performing blocking schema change on db1091 (depooled) T151029 [11:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:56] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [11:16:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A few details that should be fixed" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [11:16:26] (03CR) 10Giuseppe Lavagetto: [C: 031] PDF Render Service: Add to SCB [puppet] - 10https://gerrit.wikimedia.org/r/305259 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [11:16:34] !log Deploy ALTER table db1091 commonswiki.revision - T147305 [11:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:55] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [11:17:20] there is contention on commonswiki [11:17:36] probably just a higher rate of linksupdate jobs [11:17:46] not worrying yet [11:18:03] yep, seeing the locks errors now [11:22:06] (03PS1) 10Alexandros Kosiaris: Introduce mwdebug1001, mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/322865 (https://phabricator.wikimedia.org/T151303) [11:28:07] (03PS25) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [11:29:00] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce mwdebug1001, mwdebug1002 [dns] - 10https://gerrit.wikimedia.org/r/322863 (owner: 10Alexandros Kosiaris) [11:30:26] (03CR) 10Mobrovac: PDF Render Service: Role and module (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [11:34:14] (03PS1) 10Jcrespo: Repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322868 (https://phabricator.wikimedia.org/T151029) [11:34:39] ^marostegui [11:35:04] jynus: yes, go ahead [11:35:18] still lagged [11:35:36] From yours right? Mine was non blocking [11:36:11] yes, mine was blocking on purpose [11:36:23] yes I know [11:37:50] it should be ok now [11:38:49] <_joe_> mobrovac: I made you make a mistake :P [11:38:53] <_joe_> let me fix it [11:39:29] hehehe [11:40:30] 06Operations: logrotate failing with $FILE.1.gz: File exists - https://phabricator.wikimedia.org/T151314#2813912 (10Volans) [11:40:45] (03CR) 10Jcrespo: [C: 031] sanitarium2.my.cnf.erb: Enable triggers in RBR [puppet] - 10https://gerrit.wikimedia.org/r/322855 (https://phabricator.wikimedia.org/T150960) (owner: 10Marostegui) [11:41:16] (03PS26) 10Giuseppe Lavagetto: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [11:41:19] (03PS3) 10Marostegui: sanitarium2.my.cnf.erb: Enable triggers in RBR [puppet] - 10https://gerrit.wikimedia.org/r/322855 (https://phabricator.wikimedia.org/T150960) [11:42:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [11:42:09] !log fixed logrotate on cp1008, removed empty created .1.gz files T151314 [11:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:28] T151314: logrotate failing with $FILE.1.gz: File exists - https://phabricator.wikimedia.org/T151314 [11:43:29] (03PS4) 10Marostegui: sanitarium2.my.cnf.erb: Enable triggers in RBR [puppet] - 10https://gerrit.wikimedia.org/r/322855 (https://phabricator.wikimedia.org/T150960) [11:45:58] (03PS3) 10Mobrovac: PDF Render Service: Add to SCB [puppet] - 10https://gerrit.wikimedia.org/r/305259 (https://phabricator.wikimedia.org/T143129) [11:46:48] (03CR) 10Marostegui: [C: 032] sanitarium2.my.cnf.erb: Enable triggers in RBR [puppet] - 10https://gerrit.wikimedia.org/r/322855 (https://phabricator.wikimedia.org/T150960) (owner: 10Marostegui) [11:50:10] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/4628/" [puppet] - 10https://gerrit.wikimedia.org/r/305259 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [11:50:12] (03PS4) 10Giuseppe Lavagetto: PDF Render Service: Add to SCB [puppet] - 10https://gerrit.wikimedia.org/r/305259 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [11:50:17] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] PDF Render Service: Add to SCB [puppet] - 10https://gerrit.wikimedia.org/r/305259 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [11:50:17] lol [11:51:02] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[electron-render/deploy] [11:51:20] <_joe_> wow [11:51:22] _joe_: related? [11:51:27] probably [11:51:27] <_joe_> it is, yes [11:51:37] <_joe_> mobrovac: uhm what did we forget? [11:51:48] <_joe_> I disabled puppet on scb in the meantime, anyways [11:52:09] tin is not complaining? [11:52:21] <_joe_> it will :P [11:52:34] <_joe_> I was about to run puppet there [11:52:42] try and let's see [11:53:01] <_joe_> I suppose it will fail as well [11:53:26] i can't see the log so i have no idea why it is failing [11:54:27] <_joe_> it worked fine on tin [11:54:31] <_joe_> let me see on mira [11:54:50] <_joe_> in fact I was checking the code and everything is there [11:55:40] (03CR) 10Jcrespo: [C: 032] Repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322868 (https://phabricator.wikimedia.org/T151029) (owner: 10Jcrespo) [11:56:11] <_joe_> so it was a temporary failure [11:56:12] (03Merged) 10jenkins-bot: Repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322868 (https://phabricator.wikimedia.org/T151029) (owner: 10Jcrespo) [11:56:43] <_joe_> to contact gerrit [11:57:01] oh ok [11:57:02] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:57:08] there we go ^ [11:57:21] _joe_: so let's go with codfw first? [11:57:28] <_joe_> already doing it [11:57:31] <_joe_> on one host [11:57:39] scb in codfw that is [11:57:55] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1091 (duration: 00m 49s) [11:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] <_joe_> mobrovac: puppet will take its time to run, it has to install all the damn font packages :) [12:00:48] oh right [12:01:55] 06Operations, 06Operations-Software-Development: wmf-reimage and handling of "-n" option - https://phabricator.wikimedia.org/T144264#2813980 (10Volans) p:05Triage>03Low [12:02:19] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2813982 (10Volans) p:05Triage>03Normal [12:05:51] !log retrying schema change on db1040 (page) T151029 [12:06:04] <_joe_> mobrovac: uhm a scap deploy failure I'd say [12:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:13] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [12:07:32] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[electron-render/deploy] [12:08:42] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [12:10:50] 06Operations: stat user crontab on stat hosts for old file removal - https://phabricator.wikimedia.org/T151317#2813986 (10Volans) [12:16:32] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [12:20:16] 06Operations, 10ops-codfw, 10ops-ulsfo: cp4008 power supply failure - https://phabricator.wikimedia.org/T151275#2814027 (10Volans) [12:24:47] (03PS2) 10Alexandros Kosiaris: Introduce mwdebug1001, mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/322865 (https://phabricator.wikimedia.org/T151303) [12:24:50] (03PS1) 10Alexandros Kosiaris: debug_proxy: Use mwdebug instead of mwtest hosts [puppet] - 10https://gerrit.wikimedia.org/r/322876 [12:24:52] (03PS1) 10Alexandros Kosiaris: Revert mw1170, mw1171 to their former roles [puppet] - 10https://gerrit.wikimedia.org/r/322877 [12:25:52] PROBLEM - DPKG on scb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:27:26] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2814032 (10mobrovac) [12:27:28] 06Operations, 06Services (done), 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2814029 (10mobrovac) 05stalled>03Resolved The integration has been completed, resolving. [12:28:47] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce mwdebug1001, mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/322865 (https://phabricator.wikimedia.org/T151303) (owner: 10Alexandros Kosiaris) [12:28:51] (03PS3) 10Alexandros Kosiaris: Introduce mwdebug1001, mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/322865 (https://phabricator.wikimedia.org/T151303) [12:28:54] (03CR) 10Alexandros Kosiaris: [V: 032] Introduce mwdebug1001, mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/322865 (https://phabricator.wikimedia.org/T151303) (owner: 10Alexandros Kosiaris) [12:28:59] RECOVERY - DPKG on scb2001 is OK: All packages OK [12:30:09] PROBLEM - pdfrender on scb2004 is CRITICAL: connect to address 10.192.16.36 and port 5252: Connection refused [12:30:39] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[electron-render/deploy] [12:31:39] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:31:59] PROBLEM - pdfrender on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 5252: Connection refused [12:33:04] (03PS1) 10Mobrovac: PDF Render: Add missing libgconf2 library package [puppet] - 10https://gerrit.wikimedia.org/r/322878 (https://phabricator.wikimedia.org/T143129) [12:34:00] (03CR) 10Giuseppe Lavagetto: [C: 032] PDF Render: Add missing libgconf2 library package [puppet] - 10https://gerrit.wikimedia.org/r/322878 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [12:34:39] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:37:29] PROBLEM - DPKG on scb2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:39:29] RECOVERY - DPKG on scb2002 is OK: All packages OK [12:45:29] PROBLEM - puppet last run on scb2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[electron-render/deploy] [12:45:59] RECOVERY - Check systemd state on mw2080 is OK: OK - running: The system is fully operational [12:45:59] RECOVERY - Check systemd state on mw2084 is OK: OK - running: The system is fully operational [12:46:09] RECOVERY - pdfrender on scb2004 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.100 second response time [12:46:10] RECOVERY - Check systemd state on mw2081 is OK: OK - running: The system is fully operational [12:46:19] RECOVERY - Check systemd state on mw2083 is OK: OK - running: The system is fully operational [12:46:29] RECOVERY - Check systemd state on mw2085 is OK: OK - running: The system is fully operational [12:46:39] RECOVERY - Check systemd state on mw2082 is OK: OK - running: The system is fully operational [12:49:31] (03PS1) 10Mobrovac: PDF Render: Add libasound and libgtk2 [puppet] - 10https://gerrit.wikimedia.org/r/322881 (https://phabricator.wikimedia.org/T143129) [12:52:30] (03CR) 10Giuseppe Lavagetto: [C: 032] PDF Render: Add libasound and libgtk2 [puppet] - 10https://gerrit.wikimedia.org/r/322881 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [12:57:50] <_joe_> jouncebot: next [12:57:50] In 1 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161122T1400) [12:58:05] <_joe_> akosiaris: do you think the two machines will be installed in 1 hour? [12:59:16] PROBLEM - HHVM processes on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:59:36] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:45] (03PS1) 10Alexandros Kosiaris: jobrunners: Also ensure units disabled if stopped [puppet] - 10https://gerrit.wikimedia.org/r/322883 [12:59:46] PROBLEM - pdfrender on scb2003 is CRITICAL: connect to address 10.192.0.33 and port 5252: Connection refused [13:00:16] PROBLEM - configured eth on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:00:26] PROBLEM - dhclient process on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:00:36] PROBLEM - mediawiki-installation DSH group on mwdebug1001 is CRITICAL: Host mwdebug1001 is not in mediawiki-installation dsh group [13:00:40] (03CR) 10Giuseppe Lavagetto: [C: 031] jobrunners: Also ensure units disabled if stopped [puppet] - 10https://gerrit.wikimedia.org/r/322883 (owner: 10Alexandros Kosiaris) [13:00:56] PROBLEM - nutcracker port on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:01:16] PROBLEM - nutcracker process on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:01:26] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:01:32] _joe_: yes [13:01:38] they are running puppet now [13:01:46] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:01:46] PROBLEM - salt-minion processes on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:01:56] PROBLEM - Check size of conntrack table on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:06] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:26] PROBLEM - Check whether ferm is active by checking the default input chain on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:29] <_joe_> akosiaris: ok [13:02:36] PROBLEM - DPKG on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:56] PROBLEM - Disk space on mwdebug1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:07] PROBLEM - DPKG on scb1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:05:06] RECOVERY - DPKG on scb1002 is OK: All packages OK [13:05:16] PROBLEM - DPKG on scb1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:05:36] RECOVERY - puppet last run on scb2003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [13:06:44] (03CR) 10BBlack: [C: 031] Remove varnish::apt_preferences [puppet] - 10https://gerrit.wikimedia.org/r/322703 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [13:06:46] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[electron-render/deploy] [13:07:16] RECOVERY - DPKG on scb1003 is OK: All packages OK [13:07:37] (03CR) 10Alexandros Kosiaris: [C: 032] jobrunners: Also ensure units disabled if stopped [puppet] - 10https://gerrit.wikimedia.org/r/322883 (owner: 10Alexandros Kosiaris) [13:07:41] (03PS2) 10Alexandros Kosiaris: jobrunners: Also ensure units disabled if stopped [puppet] - 10https://gerrit.wikimedia.org/r/322883 [13:07:43] (03CR) 10Alexandros Kosiaris: [V: 032] jobrunners: Also ensure units disabled if stopped [puppet] - 10https://gerrit.wikimedia.org/r/322883 (owner: 10Alexandros Kosiaris) [13:07:46] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:08:16] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[electron-render/deploy] [13:09:19] Notice: /Stage[main]/Mediawiki::Jobrunner/Base::Service_unit[jobrunner]/Service[jobrunner]/enable: enable changed 'true' to 'false' [13:09:20] cool [13:09:56] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[electron-render/deploy] [13:10:56] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:11:16] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:11:24] (03PS1) 10BBlack: varnish: un-template v[34] syntax-helper variables [puppet] - 10https://gerrit.wikimedia.org/r/322884 (https://phabricator.wikimedia.org/T150660) [13:11:26] (03PS1) 10BBlack: varnish: remove chash director leftovers [puppet] - 10https://gerrit.wikimedia.org/r/322885 (https://phabricator.wikimedia.org/T150660) [13:12:46] <_joe_> uhm [13:13:17] (03CR) 10Addshore: "Why does the value for testwikidatawiki seem to change form 7 to 694 in this patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 (owner: 10Thiemo Mättig (WMDE)) [13:13:36] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [13:13:56] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:13:56] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:13:56] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:14:14] <_joe_> expected ^^ [13:14:36] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:14:46] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:14:46] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[pdfrender] [13:15:04] ? ^ [13:15:17] ah right [13:16:14] <_joe_> mobrovac: honestly, the best option is to make the service not running in puppet [13:16:23] <_joe_> and then do the upgrade next week [13:16:31] <_joe_> I see changes in chrooting in firejail [13:16:52] <_joe_> we would have to check like 8 services [13:16:55] (03CR) 10Thiemo Mättig (WMDE): "It's either you or me confusing something." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 (owner: 10Thiemo Mättig (WMDE)) [13:17:15] (03PS1) 10Hoo man: Kill all remaining dumpers if one failed, to start over faster [puppet] - 10https://gerrit.wikimedia.org/r/322886 [13:18:17] _joe_: let's enable => false the service for now, and i will test firejail with the other services in deployment-prep after lunch [13:18:20] then we can decide [13:19:08] lets not deploy this this week if we need upgrade firejail as well [13:19:10] <_joe_> mobrovac: I'm not sure we should decide to ignore the deployment freeze, this is by any definition a potentially-breaking change, and that's not what the exception was agreed upon [13:19:49] <_joe_> mobrovac: I'm ok with having firejail upgraded in codfw, though, so that you can test the service against production [13:20:03] (03Abandoned) 10Hashar: Basic gbp.conf / build for jessie-wikimedia [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/301581 (owner: 10Hashar) [13:20:16] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[pdfrender] [13:23:04] (03PS4) 10Thiemo Mättig (WMDE): Add missing $wgPropertySuggesterClassifyingPropertyIds for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 [13:24:30] (03PS2) 10ArielGlenn: Kill all remaining dumpers if one failed, to start over faster [puppet] - 10https://gerrit.wikimedia.org/r/322886 (owner: 10Hoo man) [13:24:38] (03PS1) 10Mobrovac: PDF Render: Disable the service for the time being [puppet] - 10https://gerrit.wikimedia.org/r/322887 [13:24:43] _joe_: ^ [13:26:00] <_joe_> ahem [13:26:03] (03PS1) 10Giuseppe Lavagetto: pdfrender: disable by default in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/322888 [13:26:22] <_joe_> mine is more complete :P [13:27:04] (03CR) 10ArielGlenn: [C: 032] Kill all remaining dumpers if one failed, to start over faster [puppet] - 10https://gerrit.wikimedia.org/r/322886 (owner: 10Hoo man) [13:27:59] (03CR) 10jenkins-bot: [V: 04-1] PDF Render: Disable the service for the time being [puppet] - 10https://gerrit.wikimedia.org/r/322887 (owner: 10Mobrovac) [13:28:09] lol [13:28:18] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[pdfrender] [13:29:08] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [13:29:08] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:08] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:19] (03CR) 10Mobrovac: [C: 031] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/322888 (owner: 10Giuseppe Lavagetto) [13:29:38] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[pdfrender] [13:29:41] (03Abandoned) 10Mobrovac: PDF Render: Disable the service for the time being [puppet] - 10https://gerrit.wikimedia.org/r/322887 (owner: 10Mobrovac) [13:29:48] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:29:58] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [13:30:48] PROBLEM - pdfrender on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 5252: Connection refused [13:31:07] (03PS2) 10Giuseppe Lavagetto: pdfrender: disable by default in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/322888 [13:31:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] pdfrender: disable by default in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/322888 (owner: 10Giuseppe Lavagetto) [13:31:27] (03PS3) 10Giuseppe Lavagetto: pdfrender: disable by default in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/322888 [13:31:29] (03CR) 10Giuseppe Lavagetto: [V: 032] pdfrender: disable by default in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/322888 (owner: 10Giuseppe Lavagetto) [13:31:48] RECOVERY - Disk space on mwdebug1001 is OK: DISK OK [13:31:48] RECOVERY - Check size of conntrack table on mwdebug1001 is OK: OK: nf_conntrack is 0 % full [13:31:48] RECOVERY - nutcracker port on mwdebug1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:31:48] RECOVERY - DPKG on mwdebug1001 is OK: All packages OK [13:31:51] _joe_: hm, actually, shouldn't monitoring be disabled if is_active => False ? [13:31:58] RECOVERY - nutcracker process on mwdebug1001 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:32:08] RECOVERY - configured eth on mwdebug1001 is OK: OK - interfaces up [13:32:08] PROBLEM - mediawiki-installation DSH group on mwdebug1002 is CRITICAL: Host mwdebug1002 is not in mediawiki-installation dsh group [13:32:08] PROBLEM - pdfrender on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 5252: Connection refused [13:32:18] RECOVERY - HHVM processes on mwdebug1001 is OK: PROCS OK: 6 processes with command name hhvm [13:32:18] RECOVERY - Check whether ferm is active by checking the default input chain on mwdebug1001 is OK: OK ferm input default policy is set [13:32:18] RECOVERY - dhclient process on mwdebug1001 is OK: PROCS OK: 0 processes with command name dhclient [13:32:38] RECOVERY - salt-minion processes on mwdebug1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:33:48] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:33:58] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [13:34:08] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 80: Connection refused [13:34:08] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:34:18] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:34:38] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:34:58] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: connect to address 10.64.32.124 and port 80: Connection refused [13:35:38] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 80: Connection refused [13:36:18] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: connect to address 10.64.32.124 and port 80: Connection refused [13:37:08] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 71789 bytes in 8.390 second response time [13:37:48] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational [13:38:18] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.136 second response time [13:38:25] mark: ack, but is installing firejail in codfw ok with you seeing it's not the active dc? [13:38:40] yeah fine with me [13:38:44] kk [13:38:48] PROBLEM - DPKG on mwdebug1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:39:08] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:39:48] RECOVERY - DPKG on mwdebug1001 is OK: All packages OK [13:40:58] RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational [13:41:38] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.826 second response time [13:42:08] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 71660 bytes in 0.424 second response time [13:42:18] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:42:47] (03CR) 10Alexandros Kosiaris: [C: 032] debug_proxy: Use mwdebug instead of mwtest hosts [puppet] - 10https://gerrit.wikimedia.org/r/322876 (owner: 10Alexandros Kosiaris) [13:42:51] (03PS2) 10Alexandros Kosiaris: debug_proxy: Use mwdebug instead of mwtest hosts [puppet] - 10https://gerrit.wikimedia.org/r/322876 [13:42:54] (03CR) 10Alexandros Kosiaris: [V: 032] debug_proxy: Use mwdebug instead of mwtest hosts [puppet] - 10https://gerrit.wikimedia.org/r/322876 (owner: 10Alexandros Kosiaris) [13:42:58] jouncebot: next [13:42:58] In 0 hour(s) and 17 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161122T1400) [13:43:33] (03PS5) 10Hashar: Enable RevisionSlider (non BetaFeature) on de,ar,hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319541 (https://phabricator.wikimedia.org/T148646) (owner: 10Addshore) [13:43:53] hashar: will you be okay to dpeloy that one? yet again I am in a meeting :P [13:44:02] of course I'll be here to check it! :) [13:44:15] <_joe_> hashar: so, when you have to deploy, ping me :) [13:44:23] addshore: yeah I will do it [13:44:23] <_joe_> mw1017 and mw1099 are gone [13:44:37] oooh [13:44:39] addshore: are you sure the Deutsch Wikipedia agreed to it ? [13:45:03] hashar: yes, per it being their technical wish. [13:45:04] addshore: I dont want the change to start a riot of some sort though the ext has been written by WMDE so I guess you guys have covered it :] [13:45:16] _joe_: gone ?? [13:45:34] _joe_: have you just sprinted a migration to Ganeti? [13:45:37] <_joe_> hashar: and substituted with mwdebug1001 and mwdebug1002 [13:45:41] <_joe_> yeah [13:45:42] .. [13:45:46] Yep, everything for the german community should be covered, and the other 2 wikis have tickets & local discussions :) [13:46:01] <_joe_> so if you select mw1017 it will point to mwdebug1001, mw1099 to mwdebug1002 [13:46:05] addshore: yeah so all set. Will push it :] thank you and happy meeting [13:46:13] addshore: good luck with all the RTL related bugs [13:46:13] <_joe_> hashar: I am about to fix all docs and send an email to the ops list [13:46:45] _joe_: oh you have setup aliases that is clever. We will want to update the chrome and firefox extensions I guess [13:47:04] zeljkof: mw1099 is gone! replaced by mwdebug1001 / mwdebug1002 :] [13:47:26] <_joe_> hashar: yes, I'm just sorry we didn't make it to send the email earlier [13:47:54] RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.085 second response time [13:47:58] well you are around just before swat to announce it :] [13:48:04] RECOVERY - pdfrender on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.084 second response time [13:48:05] that is good enough as far as I am concerned [13:48:26] Hi guys [13:48:51] I am hesitating between "that is cool, migration done so fast" and " what is going to magically break :D" [13:48:59] hi arseny92 [13:49:00] hashar: can we use them for scap today? [13:49:06] <_joe_> zeljkof: yes [13:49:12] zeljkof: yeah should be good [13:49:21] because puppet :] [13:49:26] <_joe_> zeljkof: I'll send an email in a couple of minutes [13:49:36] and _joe_ knows about scap deployment so it should be all fine [13:49:38] _joe_: browser extensions are updated? [13:50:00] zeljkof: _joe_ I am about to go and make a PR for the extensions :) [13:50:24] hashar is swatting? hi addshore [13:50:25] we should get a mini service that exposes our servers [13:50:31] like mwappoid [13:50:38] <_joe_> zeljkof: nope, but they're mapped [13:50:41] that would server the debug machines over a RESTBase json entry point [13:50:53] _joe_: do mw2017 and mw2099 also have new names? [13:51:15] (or really get those test servers flagged in conftool somehow so extensions can grab the list from https://config-master.wikimedia.org/conftool/ ) [13:52:24] <_joe_> addshore: nope [13:52:40] <_joe_> hashar: we've mostly done it for chrome already [13:54:34] (03CR) 10Hashar: [C: 032] Enable RevisionSlider (non BetaFeature) on de,ar,hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319541 (https://phabricator.wikimedia.org/T148646) (owner: 10Addshore) [13:54:55] :( [13:54:59] 502 Bad Gateway [13:55:01] from nginx [13:55:09] dont we have aliases? [13:55:14] (03Merged) 10jenkins-bot: Enable RevisionSlider (non BetaFeature) on de,ar,hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319541 (https://phabricator.wikimedia.org/T148646) (owner: 10Addshore) [13:56:48] <_joe_> hashar: uhm [13:56:59] <_joe_> hashar: which are you testing atm [13:57:02] I am using the firefox extension [13:57:07] using mw1099.eqiad.wmnet [13:57:14] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:57:15] I also get a 502 there. [13:57:29] <_joe_> yeah let me check [13:58:28] <_joe_> hah, the usual issue [13:58:35] <_joe_> it will be fixed in 30 seconds [13:58:47] working for me now :) [13:59:14] <_joe_> basically we ran puppet on one host (eqiad) but not the active one (codfw) for debug_proxy [13:59:20] the extension has a dropdown list to select the server and has mw1017 [13:59:40] <_joe_> arseny92: if you choose that, it will be routed to mwdebug1001 [13:59:44] or isthat also decom? [13:59:55] hashar: is the patch on mw1009 / mwdebug1002? [14:00:06] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161122T1400). Please do the needful. [14:00:06] Addshore: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:12] if so then the extension needs an update:) [14:00:14] patch is on mwdebug1001 [14:00:18] checking [14:00:19] and mwdebug1002 [14:00:34] RECOVERY - mediawiki-installation DSH group on mwdebug1001 is OK: OK [14:00:47] hashar: looks good, feel free to roll it out [14:01:52] arseny92: I have just made PRs for all of the browser extensions https://github.com/wikimedia/ChromeWikimediaDebug/pull/6 https://github.com/wikimedia/FirefoxWikimediaDebug/pull/19 https://github.com/paladox/EdgeWikimediaDebug/pull/1 [14:02:28] hashar: you are doing swat today? [14:02:36] addshore: I cant see the revision slider on eg https://de.wikipedia.org/w/index.php?title=Benutzer:Hashar&action=history [14:02:55] hashar: it is no on history pages, but on diff pages [14:03:01] OH MAN [14:03:10] IT IS KAPUT UX!! [14:03:19] =o? [14:03:28] I love that feature really [14:03:51] I clearly remember the gadget/javascript script that Zak Greant wrote like 6 years ago [14:04:06] it was adding a small sparkline at the top of articles / history page [14:04:25] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable RevisionSlider (non BetaFeature) on de,ar,hewiki - T149995 T148646 T150573 (duration: 00m 54s) [14:04:26] make the diffs easier to review [14:04:43] _joe_: all done thx :] [14:04:44] Yeh, its one of my favourite things over the past 12 months"! [14:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:48] T148646: Enable RevisionSlider (non betafeature) for all users on dewiki - https://phabricator.wikimedia.org/T148646 [14:04:49] T150573: Enable RevisionSlider (non betafeature) on hewiki - https://phabricator.wikimedia.org/T150573 [14:04:49] T149995: RevisionSlider release on Arabic Wikipeda - https://phabricator.wikimedia.org/T149995 [14:06:01] _joe_: mostly uneventful! guess you can send the announcement email [14:06:06] thanks for that! [14:07:09] jouncebot: next [14:07:09] In 2 hour(s) and 52 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161122T1700) [14:07:21] zeljkof: swat all done yeah [14:07:27] hashar: great [14:07:28] !log European SWAT completed [14:07:31] that was quick [14:07:43] a single patch, even thought addshore would have deployed it by himself but he was in some meeting [14:07:45] annnnd [14:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:51] we have new mw app debug servers \o/ [14:10:38] hashar, _joe_: could you also please update the docs? https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Test_Canary [14:10:51] (since mw1099 is gone) [14:12:01] hashar: zeljkof I may have 1 more patch for this window [14:12:20] hashar: around for more swat' [14:12:22] ? [14:13:15] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert mw1170, mw1171 to their former roles [puppet] - 10https://gerrit.wikimedia.org/r/322877 (owner: 10Alexandros Kosiaris) [14:14:04] (03PS2) 10Alexandros Kosiaris: Revert mw1170, mw1171 to their former roles [puppet] - 10https://gerrit.wikimedia.org/r/322877 [14:14:12] (03CR) 10Alexandros Kosiaris: [V: 032] Revert mw1170, mw1171 to their former roles [puppet] - 10https://gerrit.wikimedia.org/r/322877 (owner: 10Alexandros Kosiaris) [14:14:17] (03PS5) 10Addshore: Add missing $wgPropertySuggesterClassifyingPropertyIds for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 (owner: 10Thiemo Mättig (WMDE)) [14:14:20] (03CR) 10Addshore: [C: 031] Add missing $wgPropertySuggesterClassifyingPropertyIds for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 (owner: 10Thiemo Mättig (WMDE)) [14:14:40] !log Deploy ALTER table db1040 (master) commonswiki.revision - https://phabricator.wikimedia.org/T147305 [14:14:50] zeljkof: hashar ^^ that one [14:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:18] ^ disgregard my previous message, I will do it tomorrow morning instead [14:15:38] 06Operations, 06Labs: labstore systemd state Icinga alarms - https://phabricator.wikimedia.org/T151322#2814238 (10Volans) [14:16:50] addshore: looks like hashar is not around, I can deploy it [14:16:56] addshore: is it in the calendar? [14:17:05] zeljkof: I have just added it to the calendar! :) [14:17:17] addshore: ok, on it [14:17:27] !log EU SWAT continues [14:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:44] sorry got distracted [14:19:33] hashar: want to do the swat? [14:19:50] can [14:20:05] hashar: ok, great, please do, I'm in a mil [14:20:08] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1170.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=appserver', 'service=apache2']) [14:20:10] middle of something else [14:20:12] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1171.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=appserver', 'service=apache2']) [14:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:30] (03CR) 10Hashar: [C: 032] Add missing $wgPropertySuggesterClassifyingPropertyIds for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 (owner: 10Thiemo Mättig (WMDE)) [14:20:32] addshore: hashar will do the swat [14:20:36] great! [14:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:05] (03Merged) 10jenkins-bot: Add missing $wgPropertySuggesterClassifyingPropertyIds for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 (owner: 10Thiemo Mättig (WMDE)) [14:22:15] !log hashar@tin Started scap: Add missing $wgPropertySuggesterClassifyingPropertyIds for beta [14:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:24] 06Operations, 06Labs: labstore systemd state Icinga alarms - https://phabricator.wikimedia.org/T151322#2814266 (10chasemp) p:05High>03Normal a:03madhuvishy labstore1002 is waiting for reimage, and labstore2001 is not in service atm but I'm not sure why in that case. But it can wait for @madhuvishy to re... [14:23:39] hi. okay if i add some stuff to swat? i know i'm kind of late [14:24:45] ACKNOWLEDGEMENT - Check systemd state on labstore1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Volans Handled in https://phabricator.wikimedia.org/T151322 [14:24:58] ACKNOWLEDGEMENT - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Volans Handled in https://phabricator.wikimedia.org/T151322 [14:25:19] 06Operations, 06Labs: labstore systemd state Icinga alarms - https://phabricator.wikimedia.org/T151322#2814272 (10Volans) Ok, I've ACK'ed the two critical in Icinga with this ticket. [14:26:40] (03CR) 10Ottomata: [C: 031] udp2log: prevent Ganglia install when it is not used [puppet] - 10https://gerrit.wikimedia.org/r/322639 (https://phabricator.wikimedia.org/T151169) (owner: 10Hashar) [14:26:46] (added https://gerrit.wikimedia.org/r/#/c/322892/ to swat) [14:27:27] ottomata: hello! I wasnt / not sure who is the owner of udp2log :/ [14:27:52] hashar: not me anymore! but i'm probably most familiar with puppet [14:28:00] the only prod use of udp2log is fluorine mw logs [14:28:08] mobrovac, _joe_: what's with the scb* alerts? [14:28:18] (03PS2) 10Ema: Remove varnish::apt_preferences [puppet] - 10https://gerrit.wikimedia.org/r/322703 (https://phabricator.wikimedia.org/T150660) [14:28:22] (03CR) 10Ema: [C: 032 V: 032] Remove varnish::apt_preferences [puppet] - 10https://gerrit.wikimedia.org/r/322703 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [14:28:37] 06Operations, 06Labs: create-dbusers service failing on labstore1004 - https://phabricator.wikimedia.org/T151310#2814276 (10chasemp) I believe this may happen periodically as the script makes a connection to both ldap servers (to round robin queries) to determine if new users exist and those servers currently... [14:29:17] anyway, i'm compiling that change for fluorine . no changes, so i will merge [14:29:24] (03CR) 10Ottomata: [C: 032] udp2log: prevent Ganglia install when it is not used [puppet] - 10https://gerrit.wikimedia.org/r/322639 (https://phabricator.wikimedia.org/T151169) (owner: 10Hashar) [14:29:35] (03PS2) 10Ottomata: udp2log: prevent Ganglia install when it is not used [puppet] - 10https://gerrit.wikimedia.org/r/322639 (https://phabricator.wikimedia.org/T151169) (owner: 10Hashar) [14:29:40] neat ! [14:30:12] (03CR) 10Ottomata: [V: 032] udp2log: prevent Ganglia install when it is not used [puppet] - 10https://gerrit.wikimedia.org/r/322639 (https://phabricator.wikimedia.org/T151169) (owner: 10Hashar) [14:30:53] ottomata: scb fails are you? [14:31:15] what kinda fails!? [14:31:26] evenstreams isn't showing up in pybal yet, apparently i have to restart pybal to do that [14:31:31] didn't realize that yesterday [14:31:37] but i wasn't aware of any generic scb fails [14:31:47] addshore: somehow scap sync takes a while :/ [14:31:48] !log hashar@tin scap aborted: Add missing $wgPropertySuggesterClassifyingPropertyIds for beta (duration: 09m 32s) [14:31:48] looking [14:32:03] <_joe_> !log upgraded firejail on all scb nodes in codfw [14:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:10] hashar: you can do it file by file? [14:32:11] hm, the check systemd state? [14:32:14] RECOVERY - mediawiki-installation DSH group on mwdebug1002 is OK: OK [14:32:18] dunno what that is, will check it, is surely related [14:32:27] ottomata: we have automatic monitoring of all deployed services on scb [14:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:29] <_joe_> ottomata: it is [14:32:39] hashar: Wikibase-labs.php Wikibase-production.php Wikibase.php [14:32:45] addshore: yes [14:32:46] doing so [14:32:48] ya mobrovac eventstreams is there now. the service is failing because apparently pybal hadn't picked up the change [14:32:54] !log hashar@tin Synchronized wmf-config: Add missing $wgPropertySuggesterClassifyingPropertyIds for beta (duration: 00m 56s) [14:32:55] didn't know about that part until i had to leave yesterday [14:33:07] also, [14:33:09] i added this [14:33:11] https://gerrit.wikimedia.org/r/#/c/322732/ [14:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:13] ottomata: so also pdfrender.service is you? [14:33:24] volans: no, pdfrender i aint no nuthin [14:33:24] !log disabling puppet on caches ahead of unified cert update [14:33:26] :) [14:33:33] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2814282 (10hashar) [14:33:33] volans: that's _joe_ and me [14:33:36] paravoid: if you have a sec, +1 on this? paravoid, i think those are me [14:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:39] oops [14:33:40] haha [14:33:43] on this: https://gerrit.wikimedia.org/r/#/c/322732/ [14:33:50] (03PS2) 10BBlack: add planet and wmfusercontent to unified SAN checks [puppet] - 10https://gerrit.wikimedia.org/r/322697 [14:33:52] (03PS2) 10BBlack: remove unused r::c::ssl::misc [puppet] - 10https://gerrit.wikimedia.org/r/322696 [14:33:54] (03PS2) 10BBlack: cache_misc - switch to unified cert only [puppet] - 10https://gerrit.wikimedia.org/r/322695 [14:33:56] (03PS2) 10BBlack: caches: switch to new active unified TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/322694 [14:34:04] i want to merge that, and restart pybals in a bit (gotta help move a giant woodstove real quick though...afk for a bit) [14:34:22] (03CR) 10BBlack: [C: 032 V: 032] caches: switch to new active unified TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/322694 (owner: 10BBlack) [14:35:10] ottomata: merged your pending change [14:35:31] udp2log: prevent Ganglia install when it is not used (8090a3f) [14:35:40] hashar: i just added one more patch to the ongoing swat, if that's okay with you? [14:36:13] _joe_: should we ack the systemd failed check on scb100x until next week? [14:36:24] oh sorry bblack, thanks [14:36:27] <_joe_> mobrovac: it's caused by eventstreams [14:36:27] MatmaRex: which gerrit change please? [14:36:29] <_joe_> not by us [14:36:35] will fix today [14:36:46] !log deployed new unified certs to cache_maps [14:36:50] hashar: https://gerrit.wikimedia.org/r/#/c/322892/ [14:36:58] s/ed/ing/ but close enough heh [14:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:10] volans: where do you see a pb with pdfrender? [14:37:32] MatmaRex: will do :] [14:37:43] thanks [14:37:45] thanks for your work hashar :) [14:37:49] getting a coffee [14:37:58] mobrovac: pdfrender.service failed, icinga alarm is systemd state [14:38:02] be back to sync deploy that UploadWizard patch [14:39:00] _joe_: so we should ack that then ^ [14:39:13] or after ottomata fixes eventstreams [14:39:29] <_joe_> volans: uhm let me see [14:39:37] but wait, odfrender should be stopped there [14:39:42] wth? [14:40:11] 06Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on mw1017, mw1099 - https://phabricator.wikimedia.org/T150912#2814301 (10Dereckson) 05Open>03Invalid As the server has been decom, this task isn't interesting anymore. [14:40:25] (03CR) 10Ema: [C: 031] varnish: remove chash director leftovers [puppet] - 10https://gerrit.wikimedia.org/r/322885 (https://phabricator.wikimedia.org/T150660) (owner: 10BBlack) [14:40:29] <_joe_> mobrovac: it's still marked as failed from its last run [14:40:50] and it hasn't run since? [14:40:55] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [14:40:57] haha [14:40:58] ok [14:41:04] <_joe_> nope, because puppet has not started it again [14:41:12] TZ=UGT Good morning. [14:41:54] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [14:42:01] !log deployin new unified certs to cache_upload + cache_text [14:42:07] <_joe_> mobrovac: systemctl reset-failed is the solution [14:42:14] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:14] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [14:42:17] TZ=C git log --date=local [14:42:18] !! [14:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:31] MatmaRex: rolling the UW change :] [14:42:41] _joe_: nice tip! thnx [14:42:42] I guess something get fixed as a result [14:42:43] !log hashar@tin Synchronized php-1.29.0-wmf.3/extensions/UploadWizard/resources/mw.UploadWizardLicenseInput.js: mw.UploadWizardLicenseInput: Correct unguarded for...in - T151220 (duration: 00m 49s) [14:42:44] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [14:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:03] T151220: TypeError: undefined is not an object (evaluating title.replace) between Deed and Details steps - https://phabricator.wikimedia.org/T151220 [14:44:03] (03CR) 10BBlack: [C: 032] cache_misc - switch to unified cert only [puppet] - 10https://gerrit.wikimedia.org/r/322695 (owner: 10BBlack) [14:45:39] 06Operations: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#2814310 (10Dzahn) When i wrote this task i assumed we'd never have the role classes in autoload layout, but now we do, since manifests/role/ moved to modules/role/manifests/. [14:46:17] !log deploying new unified certs to cache_misc [14:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:50] hashar: thanks, works as expected [14:48:56] (03CR) 10Ema: [C: 031] "Confirmed noop https://puppet-compiler.wmflabs.org/4633/" [puppet] - 10https://gerrit.wikimedia.org/r/322884 (https://phabricator.wikimedia.org/T150660) (owner: 10BBlack) [14:49:44] (03CR) 10BBlack: [C: 032] remove unused r::c::ssl::misc [puppet] - 10https://gerrit.wikimedia.org/r/322696 (owner: 10BBlack) [14:49:52] (03CR) 10BBlack: [C: 032] add planet and wmfusercontent to unified SAN checks [puppet] - 10https://gerrit.wikimedia.org/r/322697 (owner: 10BBlack) [14:50:13] MatmaRex: :] [14:51:45] 06Operations: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#2814329 (10Dzahn) ..but we still can't remove this exception mostly because mariadb, openstack/nova and eventlogging have multiple classes in a single file and would have to be spl... [14:52:56] 06Operations: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#2814332 (10Dzahn) which i tried for example here but there was resistance to doing that (https://gerrit.wikimedia.org/r/#/c/315343/) [14:54:27] 06Operations, 10ops-eqiad: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2802840 (10Eevans) Just so that I can plan accordingly, is there an ETA on this? [14:58:04] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2814343 (10Dzahn) the following are what is left that would have to be fixed to finally remove the exception that makes us skip the autoloader check:... [14:58:28] (03PS1) 10Eevans: enable instance restbase2011-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322896 (https://phabricator.wikimedia.org/T151086) [14:59:38] 06Operations: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#2814349 (10hashar) Yeah I noticed that. Then T119042 was to migrate everything out of manifests/role so I guess it is now easy to make all the modules to respect the autoloader layout... [14:59:50] (03CR) 10Eevans: [C: 031] "Ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/322896 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [14:59:51] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search, 10Elasticsearch: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#2814352 (10Gehel) [14:59:53] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2814365 (10Dzahn) also, there is --no-140chars-check now and we also trigger that with some lines and don't disable it in config [15:00:00] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#2814366 (10Gehel) [15:01:35] 06Operations: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#2814370 (10Dzahn) Yes, but my last comment was about how it's unfortunately not easy even though that migration out of manifests/role happened, i could not get those things merged. [15:01:56] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2814371 (10hashar) For autoloader_layout-check there is only 51 warnings: ``` $ bundle exec rake puppetlint|wc -l 51 ``` ``` modules/interface/manif... [15:04:07] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search, 10Elasticsearch: Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325#2814372 (10Gehel) [15:05:17] 06Operations: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#2814389 (10Dzahn) Yes, that is exactly what i said, pretty much all of that depends on mariadb, the other 2 are openstack and eventlogging. i once had patches for all of them... [15:05:49] mutante: sorry for the exact rephrasing :( [15:05:59] guess I am confused / failed to properly parse what you said! [15:06:31] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2814395 (10chasemp) [15:06:35] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Add config option in tools webservice debian package to write logs to /dev/null - https://phabricator.wikimedia.org/T149946#2814392 (10chasemp) 05Open>03Resolved a:03chasemp We reverted everything yesterday afternoon and so far no incident. [15:06:38] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2814396 (10Dzahn) That's what i just pasted, yea, looks "easy" but in reality it's not, see comments on https://gerrit.wikimedia.org/r/#/c/315343/ fo... [15:06:41] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2652289 (10chasemp) >>! In T146154#2799384, @chasemp wrote: > On second thought this should remain open until {T149946} is done (a... [15:11:14] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:11:28] so what's with the eventstreams alerts then? [15:11:32] mobrovac, ottomata ^ [15:11:42] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search, 10Elasticsearch: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2814411 (10Gehel) [15:14:10] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search, 10Elasticsearch: move data to /srv for the cirrus / elasticsearch clusters - https://phabricator.wikimedia.org/T151328#2814443 (10Gehel) [15:14:18] paravoid: eventstreams is unhappy because apparently i have to restart pybal for it to pick up the lvs changes [15:14:26] didn't know that yesterday [15:14:35] sorry, had to move a giant woodstove, will fix asap [15:15:46] 06Operations: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#768414 (10Ottomata) I can refactor the eventlogging stuff if you need [15:19:05] !log scb in codfw restarting all services to pick up the new firejail [15:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:58] (03CR) 10Ottomata: [C: 032] Allow lvs service monitoring to specify critical parameter for monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/322732 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [15:21:09] (03PS4) 10Ottomata: Allow lvs service monitoring to specify critical parameter for monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/322732 (https://phabricator.wikimedia.org/T143925) [15:21:14] (03CR) 10Ottomata: [V: 032] Allow lvs service monitoring to specify critical parameter for monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/322732 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [15:21:46] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2814495 (10chasemp) [15:21:48] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2814494 (10chasemp) 05Open>03Resolved [15:21:57] why is eventstreams lvs non-critical now? :) [15:23:17] paravoid: its brand new, and i want to do some tests with real data and lots of consumers first, we'll also be doing a few deploys/config changes in the next few weeks. the plan is to announce it as a real thing with an rcstream deprecation plan next quarter [15:23:19] i want it to alert [15:23:22] but i don't want it to page folks [15:23:26] yet [15:23:44] I actually was thinking of talking to you (and maybe nuria?) about rcstream and all that [15:23:49] ja please [15:23:50] I should do so in an email :) [15:23:55] (03CR) 10Rush: "Let's do this first week back from Thanksgiving?" [puppet] - 10https://gerrit.wikimedia.org/r/322270 (https://phabricator.wikimedia.org/T133911) (owner: 10Hashar) [15:24:00] k [15:25:21] (03PS1) 10Alexandros Kosiaris: Test the future parser in puppet compiler [puppet] - 10https://gerrit.wikimedia.org/r/322898 [15:26:38] paravoid: know of any docs about rolling restart of pybal to pick up service changes? [15:27:21] no, but restarting each of the two in the pair spaced apart should suffice [15:28:08] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2814516 (10mobrovac) [15:28:10] 06Operations, 10Electron-PDFs, 07Service-deployment-requests, 06Services (doing), and 2 others: New service request - PDF Render - https://phabricator.wikimedia.org/T143129#2814513 (10mobrovac) 05Open>03stalled The service has been deployed on SCB, but is not active yet because its functioning depends... [15:30:46] (03PS1) 10DCausse: [cirrus] Increase interwiki loadtest to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322900 (https://phabricator.wikimedia.org/T149740) [15:30:59] paravoid: each of the two in the pair? sorry, am looking in wikitech and puppet and not finding [15:31:25] what is it that you want to do exactly? [15:32:14] paravoid: i want to make eventstreams.svc work [15:32:18] it resolves to the proper IP [15:32:26] eventstreams.svc.eqiad.wmnet, I assume? [15:32:29] yes [15:32:37] it it just on eqiad right now? [15:32:38] all the proper config is out there [15:32:39] yes [15:32:46] ok, so it's assigned to a pair of LVS servers [15:33:02] lvs100M and lvs100N, depending on which class you put it [15:33:08] since it's an internal one [15:33:15] I guess it's 1005/1006? [15:33:28] 3+6 [15:33:31] (03PS1) 10Faidon Liambotis: mirrors: increase check_apt_mirror thresholds [puppet] - 10https://gerrit.wikimedia.org/r/322902 [15:33:32] er, right [15:33:58] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: increase check_apt_mirror thresholds [puppet] - 10https://gerrit.wikimedia.org/r/322902 (owner: 10Faidon Liambotis) [15:34:15] so restart pybal on lvs1006 first [15:34:21] wait a few minutes, then do lvs1003 [15:34:33] also, the main reason not to expend effort on documenting this is that: it's this way because it was built in an era where new service deploys were uncommon, and in a world where they're not, we really should have pybal able to load up new service configs on the fly on its own, but that hasn't happened yet. [15:34:58] maybe doesn't even have a task, but we have other backlogged pybal work that probably comes before it regardless [15:35:39] aye ok, wait how did you find 1006 and 1003? [15:35:41] low traffix, so [15:35:43] traffix [15:35:45] ah [15:35:46] traffic* [15:35:47] :) [15:36:09] modules/lvs/manifests/configuration.pp [15:36:19] AHHH [15:36:22] there it is! [15:36:22] ok [15:36:30] you should do lvs1009/lvs1012 too, that's kind of special [15:36:30] not 1009 or 1012 though? [15:36:33] ok [15:36:40] these are the 1003/1006 replacements, they're not online yet [15:37:13] ok [15:37:50] silly q [15:37:53] !bouncing pybal on lvs1006 and then lvs1003 to pick up changes for eventstreams.svc.eqiad.wmnet [15:37:54] ja? [15:37:55] is evenstreams == kasocki? [15:37:58] no [15:38:02] but yes :) [15:38:06] kasocki was the socket.io version [15:38:08] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2813405 (10Volans) I've put both of them in scheduled downtime, the notifications were already disabled [15:38:11] and eventstreams is SSE? [15:38:14] (03PS1) 10Marostegui: site.pp: db1052's binlog changed to ROW [puppet] - 10https://gerrit.wikimedia.org/r/322903 (https://phabricator.wikimedia.org/T150960) [15:38:26] so there's not going to be a kasocki at all then [15:38:26] eventstreams is hte service template wrapper that uses kafka-sse (which is the kafka SSE library) [15:38:27] yes [15:38:33] the king is dead long live the king, ok :) [15:38:39] i mean, there was always going to be an 'eventstreams' [15:38:52] what it used for the implementation was going to be kasocki, but then we switched to kafka-sse [15:39:04] but, i am sad about kasocki, i liked it! [15:39:05] oh well [15:39:06] :) [15:39:07] http://github.com/wikimedia/kafkasse [15:39:09] ok [15:39:15] and is it going to be fronted by varnish? [15:39:34] yes [15:39:44] talked with bblack about this, its going to be varnish misc from eventstreams.wm.org [15:39:50] ok [15:40:02] it really should be stream.wm.org :) [15:40:04] gabriel really wanted it in rest.wm.org in hierarchy somehwere, but that is hard because that would have to go through texts [15:40:09] also, salt knows G@lvs_class:low-traffic and G@lvs:primary (or secondary) [15:40:12] haha, bblack told us to change hte name! [15:40:39] paravoid: stream.wm.o is the existing legacy service, and has legacy issues with being exempted from HTTPS redirects with no timeline to fix it yet [15:40:54] I figured we should aim to deploy this alongside it, move users, remove old service [15:41:12] paravoid: i had considered doing streams.wm.org instead of evenetstreams.wm.org [15:41:17] RECOVERY - LVS HTTP IPv4 on eventstreams.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.020 second response time [15:41:22] i kinda like that better, but the opinions i had got so far was that could be consufing [15:41:24] confusing [15:42:04] the situation around irc/rcstream/eventstreams is confusing enough as it is [15:42:21] it's confusing even for us, let alone our users, and I don't think we should confuse our users even more than we already are :) [15:42:35] I'd do stream.wm.org/v2/ or something like that honestly [15:42:47] paravoid: stay tuned, got plans to document and communicate big time next quarter. yeah i'd prefer to keep stream.wm.org too, but bblack said we gotta change it [15:43:00] and what I also wanted to talk to you guys about is... [15:43:07] owning IRC and RCStream as well :) [15:43:38] I know I won't be popular for that, but it's been a mess with people for various places in the org owning different parts [15:43:38] well...haha, RCstream i'm happy to take, since i want to deprecate it it [15:43:44] yeah [15:43:46] let's talk for sure [15:44:05] and this is why we are about to have three different services for the same thing :) [15:44:23] the biggest problem technologically is that these stream services want pipes to the applayer [15:44:45] and there seems to be some push to expand what these streams services can do and what situations clients use them [15:45:02] (03CR) 10Hashar: "On hold since it can possible threaten the OpenStack infrastructure and doing that just before a 4 days break in US (Thanks giving) is not" [puppet] - 10https://gerrit.wikimedia.org/r/322270 (https://phabricator.wikimedia.org/T133911) (owner: 10Hashar) [15:45:11] I don't think we want to end up in a situation where a common anonymous readonly client's JS/serviceworkers are opening pipes through our edge stack into the applayer [15:45:20] that's the far scary edge of where it could try to go [15:45:54] (03PS2) 10Marostegui: site.pp: db1052's binlog changed to ROW [puppet] - 10https://gerrit.wikimedia.org/r/322903 (https://phabricator.wikimedia.org/T150960) [15:48:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1052 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322906 (https://phabricator.wikimedia.org/T150960) [15:49:36] (03PS3) 10Jcrespo: site.pp: db1052's binlog changed to ROW [puppet] - 10https://gerrit.wikimedia.org/r/322903 (https://phabricator.wikimedia.org/T150960) (owner: 10Marostegui) [15:50:11] (03PS1) 10Dzahn: puppet-lint: ignore lives over 140 chars warning [puppet] - 10https://gerrit.wikimedia.org/r/322907 [15:50:32] (03PS4) 10Jcrespo: site.pp: db1052's binlog changed to ROW [puppet] - 10https://gerrit.wikimedia.org/r/322903 (https://phabricator.wikimedia.org/T150960) (owner: 10Marostegui) [15:51:01] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: ignore lives over 140 chars warning [puppet] - 10https://gerrit.wikimedia.org/r/322907 (owner: 10Dzahn) [15:51:19] 06Operations: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#2814620 (10Dzahn) Thanks Ottomata, that would be great. [15:51:24] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2814621 (10Cmjohnson) @jcrespo I re-seated all the components to the raid controller and powered on, all disks are now showing as 1 LD and booted to the OS You may want to do s... [15:52:03] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2814625 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson @jcrespo please re-open if problem persists. [15:52:46] (03PS1) 10Ema: varnish: remove error_synth [puppet] - 10https://gerrit.wikimedia.org/r/322908 (https://phabricator.wikimedia.org/T150660) [15:53:02] (03PS2) 10Dzahn: enable instance restbase2011-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322896 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [15:53:14] hey, cmjohnson1 thank you very much [15:53:26] yw [15:53:33] with I am a bit lost in translation [15:53:51] reseated = checked the physical connection? [15:54:38] reseated=took all the cards out and rebuilt them and put them back in [15:54:50] oh, thank you [15:55:01] sorry for the extra work [15:55:14] no extra work...it's what I do! [15:55:19] :-) [15:55:30] So can someone clarify what's going on with the connection failure to all Wikimedia sites> [15:57:11] CP678|Laptop, I'm not aware of one, can you pastebin a traceroute? [15:57:15] CP678|Laptop: can you first clarify what connection failure to all sites you're talking about? [15:57:33] Can't establish a secure connection to Wikipedia [15:57:59] Krenair: how do I traceroute? [15:58:09] what operating system are you using? [15:58:12] mac [15:58:16] no idea [15:58:21] :| [15:58:24] Which "Wikipedia"? [15:58:30] ALl of them [15:58:58] are other major sites working for you? what actual error do you get? [15:59:00] bblack: ^ sounds like cert fallout? [15:59:06] ok, you're on it [15:59:07] All wikipedia/wikimedia domains fail [15:59:20] "fail" can happen a lot of different ways, we need details [15:59:20] Everything else works [15:59:22] I believe there must be a webpage explaining how to run a traceroute on Mac? [15:59:30] hm [16:00:20] maybe it's a cert issue, can you get to gerrit.wikimedia.org? (that uses a different cert) [16:00:33] Yes [16:00:41] Running the traceroute [16:00:44] but not meta.wikimedia.org [16:01:00] PROBLEM - mysqld processes on labsdb1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:01:12] The traceroute seems to have died [16:01:16] ^that is the just booted server [16:01:22] CP678|Laptop: what time and date are shown on your system, in the actual taskbar or whatever [16:01:46] 11:01 AM Today [16:01:55] timezone? [16:01:59] what date does it have for Today though? [16:02:00] RECOVERY - mysqld processes on labsdb1009 is OK: PROCS OK: 1 process with command name mysqld [16:02:19] (03CR) 10Marostegui: [C: 032] site.pp: db1052's binlog changed to ROW [puppet] - 10https://gerrit.wikimedia.org/r/322903 (https://phabricator.wikimedia.org/T150960) (owner: 10Marostegui) [16:02:35] EST [16:02:37] jynus: labsdb1009 is taht you guys? [16:02:38] I guess the time doesn't really matter [16:02:46] 11/22/16 [16:02:47] chasemp, he said it had just booted [16:02:50] http://pastebin.com/TQy9h448 [16:02:51] ah [16:02:54] thanks Krenair [16:02:54] Tracreoute ^ [16:03:36] chasemp, it had hw problems, chris just managed to boot it [16:03:38] paravoid: nothing after zayo ^ [16:03:55] but on the other hand he can get to gerrit [16:03:56] it doesn't look like a network issue to me [16:04:02] yeah I doubt it's a network issue either [16:04:15] "can't establish a secure connection" sounds like HTTPS issues [16:04:19] but why does the traceroute stop at zayo ? [16:04:21] (03CR) 10Ema: "Functional noop, empty newlines removed. https://puppet-compiler.wmflabs.org/4638/" [puppet] - 10https://gerrit.wikimedia.org/r/322908 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [16:04:30] UDP traceroute [16:04:40] chasemp, it should be ok now [16:04:46] where is can't establish a secure connection? [16:04:55] bblack: safari [16:05:16] ah yes. UDP traceroute indeed, heh [16:05:19] I mean I don't recall reading that above, I must have missed it in the noise [16:05:43] ah I see it now [16:06:08] CP678|Laptop: can you try another browser, like Firefox or Chrome, on the same machine? [16:06:45] It loads on Firefox [16:07:40] CP678|Laptop: ok. In Safari, is there any way to get more detail about the error? clicking on some button for "advanced" or "details", or clicking on the error mesage itself, or the https lock icon, etc... ? (Sorry I don't use Safari, and I have no idea how you get more detail out of it) [16:08:16] bblack: I've been trying to that since I started. :\ [16:08:20] (03CR) 10Jcrespo: [C: 031] "Monitor behaviour of the other api servers- I do not trust them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322906 (https://phabricator.wikimedia.org/T150960) (owner: 10Marostegui) [16:08:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1052 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322906 (https://phabricator.wikimedia.org/T150960) (owner: 10Marostegui) [16:08:58] Failed to load resource: An SSL error has occurred and a secure connection to the server cannot be made. [16:09:18] ok [16:09:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1052 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322906 (https://phabricator.wikimedia.org/T150960) (owner: 10Marostegui) [16:10:39] CP678|Laptop: can you try to follow the instructions at: https://support.globalsign.com/customer/portal/articles/1353318 [16:10:54] CP678|Laptop: note there are different instructions for MacOS Sierra than earlier ones [16:11:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1052 - T150960 (duration: 00m 49s) [16:11:40] Fixed [16:11:48] following those instructions fixed it? [16:11:48] :-) [16:11:53] Yes [16:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:54] T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960 [16:12:05] CP678|Laptop: can we get the exact version of your Safari and MacOS, just for the record? [16:12:31] macOS Sierra v10.12.2 Beta (16C32f) [16:12:43] Safari Version 10.0.2 (12602.3.3) [16:12:54] thanks! [16:14:10] bblack: fyi that big banner for logged out users on the top of Wikipedia, is ugly. :p [16:14:39] (03CR) 10Dzahn: [C: 04-1] "this option will be introduced in puppet-lint 2, so first we want to get upgraded" [puppet] - 10https://gerrit.wikimedia.org/r/322907 (owner: 10Dzahn) [16:14:47] it is rather large isn't it? :) [16:14:55] Yea. :p [16:15:02] (03PS1) 10Ema: Remove Varnishkafka APT pinning [puppet] - 10https://gerrit.wikimedia.org/r/322911 (https://phabricator.wikimedia.org/T150660) [16:15:10] !log relocating dbprox1010/1011 to rack c5 [16:15:21] bblack: and you don't take Apple Pay? :O [16:15:21] it also gets smaller on the second pageview, but doesn't ever seem to go away on its own without user action. I don't remember if that's the same as last year. [16:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:42] don't ask me, I'm not in Fundraising. They just bring the funds for my salary :) [16:16:00] lol [16:16:26] (03CR) 10Dzahn: [C: 032] enable instance restbase2011-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322896 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [16:16:31] (03PS3) 10Dzahn: enable instance restbase2011-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322896 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [16:18:10] CP678|Laptop: We are working on Apple Pay. They only recently brought in apple pay for non-profits [16:22:41] (03PS1) 10BBlack: Revert "Revert "GlobalSign G2 intermediate, signed by R3"" [puppet] - 10https://gerrit.wikimedia.org/r/322913 (https://phabricator.wikimedia.org/T148045) [16:23:09] ACKNOWLEDGEMENT - eventstreams on scb1001 is CRITICAL: HTTP CRITICAL - No data received from host daniel_zahn https://phabricator.wikimedia.org/T148779 [16:23:09] ACKNOWLEDGEMENT - eventstreams on scb1002 is CRITICAL: HTTP CRITICAL - No data received from host daniel_zahn https://phabricator.wikimedia.org/T148779 [16:23:09] ACKNOWLEDGEMENT - eventstreams on scb1003 is CRITICAL: HTTP CRITICAL - No data received from host daniel_zahn https://phabricator.wikimedia.org/T148779 [16:23:09] ACKNOWLEDGEMENT - eventstreams on scb1004 is CRITICAL: HTTP CRITICAL - No data received from host daniel_zahn https://phabricator.wikimedia.org/T148779 [16:23:09] ACKNOWLEDGEMENT - eventstreams on scb2001 is CRITICAL: HTTP CRITICAL - No data received from host daniel_zahn https://phabricator.wikimedia.org/T148779 [16:23:09] ACKNOWLEDGEMENT - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T148779 [16:23:10] ACKNOWLEDGEMENT - eventstreams on scb2002 is CRITICAL: connect to address 10.192.48.43 and port 8092: Connection refused daniel_zahn https://phabricator.wikimedia.org/T148779 [16:23:10] ACKNOWLEDGEMENT - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T148779 [16:23:10] ACKNOWLEDGEMENT - eventstreams on scb2003 is CRITICAL: connect to address 10.192.0.33 and port 8092: Connection refused daniel_zahn https://phabricator.wikimedia.org/T148779 [16:23:11] ACKNOWLEDGEMENT - eventstreams on scb2004 is CRITICAL: HTTP CRITICAL - No data received from host daniel_zahn https://phabricator.wikimedia.org/T148779 [16:24:00] (03PS1) 10Andrew Bogott: wikistatus: Break out the page-editing code for re-use elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/322914 [16:28:00] (03CR) 10BBlack: [C: 032 V: 032] Revert "Revert "GlobalSign G2 intermediate, signed by R3"" [puppet] - 10https://gerrit.wikimedia.org/r/322913 (https://phabricator.wikimedia.org/T148045) (owner: 10BBlack) [16:28:54] PROBLEM - cassandra-b CQL 10.192.32.153:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.153 and port 9042: Connection refused [16:30:20] !log disabling puppet on caches to do post-merge fixup on chain certs for https://gerrit.wikimedia.org/r/#/c/322913/ [16:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:44] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:34:34] (03PS1) 10Cmjohnson: Relocated dbproxy1010 and 1011 to row C, changing dns to match new vlan [dns] - 10https://gerrit.wikimedia.org/r/322915 [16:34:44] (03PS1) 10Alexandros Kosiaris: Update Templates for 5.0.13 version [software/otrs] - 10https://gerrit.wikimedia.org/r/322916 (https://phabricator.wikimedia.org/T147331) [16:35:22] (03CR) 10Cmjohnson: [C: 032] Relocated dbproxy1010 and 1011 to row C, changing dns to match new vlan [dns] - 10https://gerrit.wikimedia.org/r/322915 (owner: 10Cmjohnson) [16:36:21] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2814733 (10Volans) I've done a bit of cleanup, re-enabling some of them that were ok and leftover of other maintenance. `maps-test*` is being worked by @Gehel for a proper fix. All the others at th... [16:36:44] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:37:26] (03PS1) 10Giuseppe Lavagetto: Add LVS IP for pdfrender [dns] - 10https://gerrit.wikimedia.org/r/322917 [16:37:36] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Update Templates for 5.0.13 version [software/otrs] - 10https://gerrit.wikimedia.org/r/322916 (https://phabricator.wikimedia.org/T147331) (owner: 10Alexandros Kosiaris) [16:38:16] (03PS2) 10Andrew Bogott: wikistatus: Break out the page-editing code for re-use elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/322914 [16:39:48] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2814739 (10Cmjohnson) @jcrespo, these 2 servers have been moved to rack C5, connected to the acces... [16:39:55] (03CR) 10Andrew Bogott: [C: 032] wikistatus: Break out the page-editing code for re-use elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/322914 (owner: 10Andrew Bogott) [16:40:21] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.32.153:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.153 and port 9042: Connection refused eevans Bootstrapping [16:41:57] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2814740 (10jcrespo) a:03jcrespo @Cmjohnson Thank you a lot! I will take it from here [16:46:42] !log roll back to globalsign R3-based intermediate for unified complete and confirmed on all hosts [16:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:20] (03Abandoned) 10Andrew Bogott: Wikitech: Increase login throttle limits x4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322207 (https://phabricator.wikimedia.org/T150373) (owner: 10Andrew Bogott) [16:49:07] 06Operations, 10ops-codfw: rack/setup prometheus200[1-2] - https://phabricator.wikimedia.org/T151338#2814780 (10Reedy) [16:49:12] (03Abandoned) 10Andrew Bogott: Don't ask LDAP about instance puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/316915 (owner: 10Andrew Bogott) [16:50:38] (03PS2) 10BBlack: varnish: un-template v[34] syntax-helper variables [puppet] - 10https://gerrit.wikimedia.org/r/322884 (https://phabricator.wikimedia.org/T150660) [16:50:59] (03CR) 10BBlack: [C: 032 V: 032] varnish: un-template v[34] syntax-helper variables [puppet] - 10https://gerrit.wikimedia.org/r/322884 (https://phabricator.wikimedia.org/T150660) (owner: 10BBlack) [16:51:09] (03PS2) 10BBlack: varnish: remove chash director leftovers [puppet] - 10https://gerrit.wikimedia.org/r/322885 (https://phabricator.wikimedia.org/T150660) [16:51:13] (03CR) 10BBlack: [C: 032 V: 032] varnish: remove chash director leftovers [puppet] - 10https://gerrit.wikimedia.org/r/322885 (https://phabricator.wikimedia.org/T150660) (owner: 10BBlack) [16:51:28] !log Testing safesubst: log message recording on wiki [16:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:00] that should be way faster [16:52:23] (03PS5) 10Andrew Bogott: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/318572 [16:52:44] (03PS6) 10Andrew Bogott: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/318572 [16:55:06] (03PS2) 10BBlack: varnish: remove error_synth [puppet] - 10https://gerrit.wikimedia.org/r/322908 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [16:55:29] (03CR) 10BBlack: [C: 032 V: 032] "manual rebase onto other cleanup, still works!" [puppet] - 10https://gerrit.wikimedia.org/r/322908 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [16:58:08] So the new testing servers are mwdebug1001 and mwdebug1002? Are there mwdebug2xxx hosts in codfw too? [16:59:11] bd808, as far as I know, the ones on codfw have not been youched [16:59:16] *touched [16:59:36] (03CR) 10Andrew Bogott: [C: 032] Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/318572 (owner: 10Andrew Bogott) [16:59:40] (03PS7) 10Andrew Bogott: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/318572 [16:59:41] I guess they probably are still in warantee [16:59:55] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on a few images Mediawiki 500s on - https://phabricator.wikimedia.org/T150756#2814878 (10Gilles) We seem to be dealing mostly with giant images here. 1610_Douai_Old_Testament.pdf, seen in T150746, crops up again. Except this time it makes Mediawiki 5... [17:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161122T1700). Please do the needful. [17:01:48] no patches to be SWAT [17:01:58] (03PS1) 10Andrew Bogott: Revert "Labs dns: Ensure the mysql server starts at boot" [puppet] - 10https://gerrit.wikimedia.org/r/322921 [17:02:22] ^yes it has [17:02:41] now, running I do not think it is a valid parameter value [17:02:55] (03CR) 10Andrew Bogott: [C: 032] Revert "Labs dns: Ensure the mysql server starts at boot" [puppet] - 10https://gerrit.wikimedia.org/r/322921 (owner: 10Andrew Bogott) [17:03:33] (03PS1) 10Andrew Bogott: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/322923 [17:03:46] among other reasons, because analytics is using it already [17:04:14] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:14] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:06:47] or enabled [17:07:07] enabled is not the parameter you merged [17:07:19] enable, on the other side... [17:08:41] (03PS1) 10Ottomata: Fix for evenstreams icinga http lvs alert [puppet] - 10https://gerrit.wikimedia.org/r/322924 (https://phabricator.wikimedia.org/T143925) [17:08:55] 06Operations, 10ops-codfw: rack/setup prometheus200[1-2] - https://phabricator.wikimedia.org/T151338#2814893 (10fgiunchedi) @Papaul thanks! racking info looks good, only requirement is being in different rows, which it is already. re: partman I don't think there's a recipe ready, unless there's a particular pr... [17:09:13] ^andrewbogott see my comments [17:10:19] jynus: ok, I will catch up [17:10:41] (03CR) 10Ottomata: [C: 032] Fix for evenstreams icinga http lvs alert [puppet] - 10https://gerrit.wikimedia.org/r/322924 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [17:11:22] (03PS1) 10Giuseppe Lavagetto: pdfrender: lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/322925 [17:11:26] <_joe_> mobrovac: ^^ [17:15:23] RECOVERY - eventstreams on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.004 second response time [17:15:23] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.003 second response time [17:15:25] (03CR) 10Mobrovac: pdfrender: lvs configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322925 (owner: 10Giuseppe Lavagetto) [17:15:33] RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.003 second response time [17:15:42] (03PS1) 10Gehel: maps - maps-test* servers are test servers [puppet] - 10https://gerrit.wikimedia.org/r/322927 (https://phabricator.wikimedia.org/T149643) [17:16:45] (03CR) 10Gehel: "I'm actually not entirely sure that we want to completely remove Ops from the contact groups of those servers. They are already non paging" [puppet] - 10https://gerrit.wikimedia.org/r/322927 (https://phabricator.wikimedia.org/T149643) (owner: 10Gehel) [17:17:53] RECOVERY - eventstreams on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.002 second response time [17:18:21] gehel: if they're not paging for me is ok to have them as normal hosts, in particular for basic host checks [17:20:12] volans: I was making the comparison with our labs VMs in my head... [17:20:42] jynus: I see the enabled/enable thing [17:20:45] but for ensure… https://docs.puppet.com/puppet/latest/reference/types/service.html#service-attribute-ensure [17:20:49] looks like 'running' is valid? [17:21:08] for service yes [17:21:13] I said not sure, first [17:21:18] and for mariadb::service [17:21:37] which is a non-bundled class [17:21:52] ok [17:21:54] !log re-enabling alerts for maps-test* servers [17:22:01] that var is passed on to a normal puppet service, so 'running' should be fine [17:22:03] I'll try again :) [17:22:03] I do not boast thinking I can understand puppet very well [17:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:10] I probably don't [17:22:25] (03PS2) 10Andrew Bogott: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/322923 [17:22:26] so do not take my suggestions very seriously [17:22:33] but the submodule is merged [17:22:48] that part, I know because analytics use it [17:22:58] and beta I think, too [17:23:44] (03PS3) 10Andrew Bogott: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/322923 [17:25:01] regarding that mysql, I have enabled socket_auth there [17:25:14] that means no root password [17:25:32] it authenticates using the unix local user [17:26:00] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2814915 (10fgiunchedi) a:05Gilles>03fgiunchedi @Gilles indeed I've granted `mw:thumbor` user access only to thumb containers, looks like we'll need to d... [17:26:03] (03CR) 10Andrew Bogott: [C: 032] Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/322923 (owner: 10Andrew Bogott) [17:27:08] 06Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2814919 (10fgiunchedi) [17:27:10] 06Operations, 13Patch-For-Review: Prometheus cronspam - https://phabricator.wikimedia.org/T151149#2814917 (10fgiunchedi) 05Open>03Resolved fixed \o/ [17:27:42] (03CR) 10Volans: [C: 04-1] "Much nicer but I think there is an error, see inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [17:31:12] !log performing schema change on db1078 (page) T69223 [17:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:23] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [17:34:33] did that create a pileup? [17:34:54] it seems it was only a temporary glitch [17:35:07] (03PS1) 10BryanDavis: debug.json: update eqiad debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322929 [17:35:50] !log performing schema change on db1075 (page) T69223 [17:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:08] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2814963 (10greg) [17:39:26] !log performing schema change on db1076 (enwiktionary.page) T69223 [17:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:37] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [17:41:24] (03PS1) 10Ottomata: Add eventstreams.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/322931 (https://phabricator.wikimedia.org/T143925) [17:41:47] (03PS2) 10Ottomata: Add eventstreams.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/322931 (https://phabricator.wikimedia.org/T143925) [17:42:32] (03CR) 10Ottomata: [C: 032] Add eventstreams.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/322931 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [17:43:43] (03CR) 10Dzahn: "< volans> actually mutante, is not possible to pass the env as a parameter?" [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) (owner: 1020after4) [17:46:08] (03CR) 10Dzahn: "< volans> or via conf file" [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) (owner: 1020after4) [17:46:22] !log swift eqiad-prod: ms-be1027 to weight 1000 T136631 [17:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:33] T136631: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631 [17:47:25] !log performing schema change on db1094 (metawiki.page) T69223 [17:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:36] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [17:47:43] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [17:47:53] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:23] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [17:50:19] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2815029 (10Dzahn) re-enabled notifications on some install1001/2001 services [17:51:51] (03PS1) 10Ottomata: Remove trailing spaces in conftool-data/nodes/codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/322934 [17:51:53] (03PS1) 10Ottomata: Configure eventstreams in codfw backed by analytics-eqiad Kafka [puppet] - 10https://gerrit.wikimedia.org/r/322935 (https://phabricator.wikimedia.org/T143925) [17:53:04] 06Operations, 10ops-codfw: rack/setup prometheus200[1-2] - https://phabricator.wikimedia.org/T151338#2815057 (10Papaul) {F4823522} [17:53:14] (03PS2) 10Ottomata: Remove trailing spaces in conftool-data/nodes/codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/322934 [17:53:19] (03CR) 10Ottomata: [C: 032] Remove trailing spaces in conftool-data/nodes/codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/322934 (owner: 10Ottomata) [17:53:36] (03CR) 10Ottomata: [C: 032 V: 032] Remove trailing spaces in conftool-data/nodes/codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/322934 (owner: 10Ottomata) [17:53:52] (03CR) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [17:53:56] (03PS2) 10Ottomata: Configure eventstreams in codfw backed by analytics-eqiad Kafka [puppet] - 10https://gerrit.wikimedia.org/r/322935 (https://phabricator.wikimedia.org/T143925) [17:54:22] (03PS10) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [17:56:23] !log restbase deploy start of 9c7822d [17:56:27] (03PS1) 10Andrew Bogott: Add wikistatus utility class pageeditor.py [puppet] - 10https://gerrit.wikimedia.org/r/322937 [17:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:57] greg-g, are we ok to deploy some fixes to Kartotherian today, or do we have a week-long freeze for everything? [17:57:23] PROBLEM - Check whether ferm is active by checking the default input chain on es2019 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:57:39] RECOVERY - eventstreams on scb2004 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.102 second response time [17:57:51] did es2019 just crashed? [17:57:59] RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.105 second response time [17:57:59] RECOVERY - eventstreams on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.086 second response time [17:58:09] PROBLEM - Check systemd state on es2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:58:18] jynus: uptime 25h [17:58:19] PROBLEM - NTP on es2019 is CRITICAL: NTP CRITICAL: Offset unknown [17:58:24] nope [17:58:27] (03CR) 10Andrew Bogott: [C: 032] Add wikistatus utility class pageeditor.py [puppet] - 10https://gerrit.wikimedia.org/r/322937 (owner: 10Andrew Bogott) [17:58:29] RECOVERY - eventstreams on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.102 second response time [17:58:31] then it is the ack expiring [17:58:52] manuel handles that, so I will just ack it again [18:00:21] mmm [18:00:29] but NTP failing? [18:00:41] jynus: icinga-wm could notify us when 10% of the downtime time is left :-P [18:00:42] that normally happens only on reboot [18:00:44] 06Operations, 10ops-codfw, 10netops: prometheus200[1-2] switch port configuration - https://phabricator.wikimedia.org/T151357#2815101 (10Papaul) [18:01:08] yurik: what are they/how important? [18:01:35] 06Operations, 10ops-codfw: rack/setup prometheus200[1-2] - https://phabricator.wikimedia.org/T151338#2814695 (10Papaul) [18:01:35] All maps generated by users at the moment get incorrectly cached for the first 1 hour [18:01:44] greg-g, ^ [18:02:13] which means users don't see any custom drawings on top of the map for the first hour, just the base map itself [18:02:29] task please [18:03:54] (03PS1) 10Papaul: Add DNS entries for prometheus200[1-2] Bug:T151338 [dns] - 10https://gerrit.wikimedia.org/r/322938 (https://phabricator.wikimedia.org/T151338) [18:04:19] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:19] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [18:06:23] greg-g, https://phabricator.wikimedia.org/T150358 [18:06:28] see comments at the end [18:06:51] greg-g, it will be service only, not MW [18:07:03] (03CR) 10Dzahn: [C: 032] "confirmed with racktables, mgmt interfaces already up" [dns] - 10https://gerrit.wikimedia.org/r/322938 (https://phabricator.wikimedia.org/T151338) (owner: 10Papaul) [18:07:19] (03PS2) 10Dzahn: Add DNS entries for prometheus200[1-2] Bug:T151338 [dns] - 10https://gerrit.wikimedia.org/r/322938 (https://phabricator.wikimedia.org/T151338) (owner: 10Papaul) [18:07:59] yurik: I don't see any attached patch? [18:10:15] (03PS1) 10Dzahn: openstack: split nova.pp into one class per file (autoload layout) [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) [18:10:57] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/4643/" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [18:12:20] (03CR) 10Dzahn: "+439 -439" [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [18:12:52] greg-g, its actually a revert of a portion of kartotherian's module "snapshot", in github. Not yet pushed to master. [18:12:58] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on a few images Mediawiki 500s on - https://phabricator.wikimedia.org/T150756#2815163 (10Gilles) Actually, I've seen it 500 in 30ish seconds now, so the 2 minutes thing might have been random. The process gets killed in Mediawiki's case, which would s... [18:13:07] greg-g, i'm planning ahead for the day [18:13:57] yurik: I think that ticket needs more information about the mechanisms involved here. Why is pageprops outdated when the map is initially generated, etc? It seems like throwing an error or reducing cache lifetime are just bandaids around a deeper issue. [18:14:14] (03CR) 10Andrew Bogott: [C: 031] "This is fine with me as long as the puppet-compiler confirms that it's a no-op on labcontrol, labtestcontrol, labvirt, silver, labnet" [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [18:14:49] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:15:39] RECOVERY - Restbase root url on restbase2010 is OK: HTTP OK: HTTP/1.1 200 - 15450 bytes in 0.082 second response time [18:16:09] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [18:16:49] RECOVERY - Restbase root url on restbase2011 is OK: HTTP OK: HTTP/1.1 200 - 15450 bytes in 0.101 second response time [18:17:23] !log restbase deploy end of 9c7822d [18:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:39] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [18:21:17] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on a few images Mediawiki 500s on - https://phabricator.wikimedia.org/T150756#2815192 (10Gilles) The GIF: https://upload.wikimedia.org/wikipedia/en/b/ba/Rahansvirtues.gif is interesting because Thumbor takes a long time to render it, while Mediawiki e... [18:22:10] (03PS1) 10Jdrewniak: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322942 (https://phabricator.wikimedia.org/T128546) [18:23:15] bblack, it is a bandaid. There is a deeper issue - image generation service needs data from MW API, but if the page was just saved, the database replicas have not yet been updated with the new data, and api fails. I used to throw an error, and that worked "okayish" - editor would get a broken image on ocasion, refresh, and it will show up. I made a mistake at "gracefully handling it" - now i generate just the base map, returning a proper [18:23:15] image. But the result is worse - now it gets cached for an hour, and users are very frustrated that the information is stale despite refresh. [18:24:11] we do need a more thorough solution for this, and i would love to discuss how to handle it with gabriel and tim during the dev summit [18:24:20] or anyone else who wants to participate of course [18:25:46] it seems like database replication lag shouldn't be that high, unless there's more to this than just replication lag. that or we need to query this data from a non-replica immediately after save? [18:26:12] or make it synchronous on edit (make the editing user wait on propagation before considering the save complete) [18:26:32] I'm sure there's much more to the problem than I understand [18:26:56] even the 1h cache lifetimes we're using are also not ideal, they should be much longer. it was an acceptable value a long time ago in testing, but :P [18:27:07] and longer cache lifetimes are only going to make these kinds of issues more-apparent [18:27:50] or handle the eventual consistency, having the client retry (at most few times with a sleep) if missing [18:28:04] yeah I don't even know what "database" we're talking bout [18:28:32] if it's cassandra, perhaps the read for the pageprops when saving a new edit needs a higher read replica count or whatever it's called [18:28:34] bblack, agree that it shouldn't depend on the caching timeout. And yes, auto-client retry might work - but first we need to return an error instead of a good image if data is missing - it would be a more honest approach. [18:28:46] regular mysql wiki db [18:28:48] (03PS1) 10Andrew Bogott: Move the role::labs::openstack::nova::wikiupdates to openstack::nova::hooks [puppet] - 10https://gerrit.wikimedia.org/r/322943 [18:29:17] we have something for this in theory - where a header is passed to indicate "needed replication timestamp" [18:29:35] yurik: it's not a db replication lag issue, it's the fact that pageprops get updated in a separate refreshlinks job [18:29:51] ok [18:29:53] mobrovac, yes, that too [18:30:07] i thought i didn't update my code to use the delayed pageprops repl yet [18:30:12] but maybe its not related [18:30:33] another way to think of that would be to encode version information in the generated map URLs (as in, ?rev=123 or a hash or whatever) so that they're unique every time they're edited [18:30:49] and if a query for a map requests a unique new change that hasn't made it to pageprops yet, return a 404 [18:31:42] bblack, we could, but for that I have to know if the new generated version is different from the one already in pageprops, plus since pageprop changes are not immediate, it still won't help [18:32:00] because i won't know what "repl timestamp" to wait for [18:32:05] (03CR) 10Andrew Bogott: [C: 032] Move the role::labs::openstack::nova::wikiupdates to openstack::nova::hooks [puppet] - 10https://gerrit.wikimedia.org/r/322943 (owner: 10Andrew Bogott) [18:32:30] so the solution to the immediate problem - throw an error. It does not fully solve it, but removes the 1 hour wait [18:32:42] so a much more "expected" behavior [18:32:51] yurik, when do you query pageprops? [18:33:25] jynus, when the static map image is being generated by kartotherian, on user's request [18:33:38] in other words - right after saving the page [18:33:50] 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2054254 (10Florian) @DZahn: i was Free and added mzself as a mentor to the task, too :) however, do you think that two instances are enough... [18:33:59] well, then it is like the categories- you cannot expect them right away [18:34:16] yurik: if "throw an error means 5xx", that won't be cached at all, it's not a good idea [18:34:17] page save has to be done immediately [18:34:41] bad quoting heh [18:35:01] jynus, (page save -> html result -> browser makes a request for the image -> data is not available) [18:35:03] I would discuss with performance about an architectural solution [18:35:06] you could return a 404 with a cache-control set to 5-10 minutes or something though [18:35:17] (as a bandaid) [18:35:28] maybe it should not be a page property if you need it right after save [18:35:40] and should be somewhere else [18:35:50] jynus, yes, there are plans for a separate db (there is a task for that), but obviouosly that's a much bigger task [18:35:56] separate table [18:35:56] (I am not a mw guru myself to tell you where) [18:36:07] but I know some that could help you [18:37:41] basically because performance is the people interested on keeping the save time low [18:38:12] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2815231 (10JKatzWMF) [18:38:54] jynus, https://phabricator.wikimedia.org/T119043 [18:39:20] there has been a lot of digital ink spilled over that one :) [18:39:42] that has nothing to do with your problem [18:40:02] (03CR) 10Andrew Bogott: [C: 04-1] "puppet compiler says https://puppet-compiler.wmflabs.org/4646/labcontrol1001.wikimedia.org/change.labcontrol1001.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [18:40:25] it is a separate issue [18:40:47] jynus, not directly, but it discusses how tables should be organized to move away from pageprops. And save to that table will need to be in-sync because its data is needed rightaway [18:40:49] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: Confirm attribution needs - https://phabricator.wikimedia.org/T150875#2815270 (10JKatzWMF) @cscott @faidon Hey folks, I am looking at identifying the actual requirements for attribution of a pdf.... [18:41:08] but yes, a big discussion on that one [18:41:25] bblack, for now i will do a 404 with a short timeout [18:41:43] btw, i still need to rework the caching headers - they are a bit messy with maps [18:44:09] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/lib/python2.7/dist-packages/wikistatus.egg-info],File[/usr/lib/python2.7/dist-packages/wikistatus] [18:45:29] 06Operations, 10ops-codfw: rack/setup prometheus200[1-2] - https://phabricator.wikimedia.org/T151338#2815301 (10RobH) [18:45:31] 06Operations, 10ops-codfw, 10netops: prometheus200[1-2] switch port configuration - https://phabricator.wikimedia.org/T151357#2815299 (10RobH) 05Open>03Resolved both ports now have proper descriptions set (hostnames), enabled, and set to the internal vlan for each row. [18:50:43] !log trying schema change on db1082 (wikidatawiki.page) T69223 [18:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:53] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [18:52:11] nope [18:53:32] !log trying schema change on db1057 (enwiki.page) T69223 [18:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:04] nope either [18:55:58] * yurik loves production-level experiments :) [18:56:52] well, it is that, or setting servers to read only for half an hour [18:57:04] the second won [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161122T1900). Please do the needful. [19:00:04] dcausse, bd808, and jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:20] one must earn the "i broke wikipedia" badge somehow! I'm not sure you have yours yet :) [19:00:29] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/lib/python2.7/dist-packages/wikistatus.egg-info],File[/usr/lib/python2.7/dist-packages/wikistatus] [19:01:13] o/ [19:01:34] I can SWAT today [19:03:49] bd808: does https://gerrit.wikimedia.org/r/#/c/322929/ mean no more mw1099? [19:04:14] I guess I should swat that one first... [19:05:32] thcipriani: you can use mwdebug1002 [19:05:50] thcipriani: the extensions will redirect requests for mw1099 to mwdebug1002 [19:06:05] Dereckson: ah, thank you! [19:06:50] (03PS2) 10Thcipriani: [cirrus] Increase interwiki loadtest to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322900 (https://phabricator.wikimedia.org/T149740) (owner: 10DCausse) [19:07:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322900 (https://phabricator.wikimedia.org/T149740) (owner: 10DCausse) [19:07:09] RECOVERY - Check systemd state on es2019 is OK: OK - running: The system is fully operational [19:07:14] thcipriani: Dereckson yeah, we should update the wikitech.wm.org swat documentation... [19:07:25] meant to do so, but am distracted with "manager work" right now, ugh :/ [19:07:29] RECOVERY - Check whether ferm is active by checking the default input chain on es2019 is OK: OK ferm input default policy is set [19:07:47] (03Merged) 10jenkins-bot: [cirrus] Increase interwiki loadtest to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322900 (https://phabricator.wikimedia.org/T149740) (owner: 10DCausse) [19:07:50] * thcipriani nods [19:07:57] can do post-swat unless someone beats me there [19:10:46] dcausse: your change is live on mwdebug1002, check please if possible [19:10:53] thcipriani: looking [19:11:19] (03Abandoned) 10Ppchelko: service::node - support sampled logging [puppet] - 10https://gerrit.wikimedia.org/r/302309 (https://phabricator.wikimedia.org/T139674) (owner: 10Ppchelko) [19:13:40] thcipriani: not sure how to test this, but I don't see anything obviously wrong... [19:13:58] dcausse: :) ok, going live. [19:17:07] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[gerrit:322900|[cirrus] Increase interwiki loadtest to 75%]] (T149740) (duration: 00m 55s) [19:17:13] ^ dcausse live everywhere [19:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:17] T149740: Run load tests of cross-project searching to verify its stability - https://phabricator.wikimedia.org/T149740 [19:17:21] thcipriani: thanks! [19:17:43] jan_drewniak: bd808 ping for SWAT [19:18:10] o/ [19:18:17] hello :) [19:19:43] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322942 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:19:50] (03CR) 10Thcipriani: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322942 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:19:53] (03PS2) 10Thcipriani: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322942 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:20:01] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322942 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:20:13] gerrit :(( [19:20:26] never lets me rebase when I know I need to [19:20:53] (03Merged) 10jenkins-bot: Bumping wikipedia.org portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322942 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [19:20:57] !log rebooting es2019 for upgrade [19:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:49] jan_drewniak: your change is live on mwdebug1002, check please [19:23:06] thcipriani: yup! looks good [19:23:26] jan_drewniak: ok, running sync-portals now [19:25:09] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table phabricator_conduit.conduit_methodcalllog: Cant find record in conduit_methodcalllog, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1048-bin.001351, end_log_pos 640518090 [19:25:21] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:322942|Bumping wikipedia.org portal to master]] (T128546) (duration: 00m 53s) [19:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:31] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [19:25:59] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:18] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:322942|Bumping wikipedia.org portal to master]] (T128546) (duration: 00m 56s) [19:26:25] ^ jan_drewniak should be live everywhere [19:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:32] thcipriani: cool thanks! [19:27:58] 06Operations, 10ops-codfw, 10netops: prometheus200[1-2] switch port configuration - https://phabricator.wikimedia.org/T151357#2815570 (10fgiunchedi) Retroactively changed descriptions to prometheus200[34] as per parent T151338 [19:28:38] 06Operations, 10ops-codfw: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2814695 (10fgiunchedi) [19:37:29] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 865.83 seconds [19:39:16] (03PS3) 10Ottomata: Configure eventstreams in codfw backed by analytics-eqiad Kafka [puppet] - 10https://gerrit.wikimedia.org/r/322935 (https://phabricator.wikimedia.org/T143925) [19:45:40] (03CR) 10Ottomata: [C: 032] Configure eventstreams in codfw backed by analytics-eqiad Kafka [puppet] - 10https://gerrit.wikimedia.org/r/322935 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [19:48:31] !log set thumbor access for temp containers - T150760 [19:48:34] thcipriani: I missed the ping :( [19:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:41] T150760: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760 [19:48:59] since https://noc.wikimedia.org/conf/debug.json is still the old content I suppose that didn't get meged [19:48:59] bd808: np, I'm around now, if you're around :) [19:49:23] nope, didn't merge yet [19:49:23] thcipriani: mw1099 is already dead as far as I know [19:49:25] !log restarting pybal on lvs2003 and lvs2006 for eventstreams in codfw [19:49:30] it is indeed [19:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:57] that file is not used by anything at all yet. o.ri made it yesterday [19:50:11] so it should be trivially safe to sync out [19:50:25] ok, no idea what the context was, so just left it, will push out now. [19:50:52] the idea is that we can update the browser extensions to read it from noc to make changing debug hostnames easier [19:50:52] ugh, gerrit not going to let me rebase again. [19:51:09] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:51:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322929 (owner: 10BryanDavis) [19:51:34] (03PS2) 10Thcipriani: debug.json: update eqiad debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322929 (owner: 10BryanDavis) [19:51:37] (03CR) 10Thcipriani: debug.json: update eqiad debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322929 (owner: 10BryanDavis) [19:51:43] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322929 (owner: 10BryanDavis) [19:52:29] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:52:35] (03Merged) 10jenkins-bot: debug.json: update eqiad debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322929 (owner: 10BryanDavis) [19:53:59] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:54:29] !log thcipriani@tin Synchronized debug.json: SWAT: [[gerrit:322929|debug.json: update eqiad debug hosts]] (duration: 00m 49s) [19:54:35] ^ bd808 live now [19:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:55] thcipriani: looks like I've got to wait for cp4003 to forget about the file before I see the new version [19:56:41] thcipriani: confirmed with https://noc.wikimedia.org/conf/debug.json?foo to break cache [19:56:48] thanks [19:56:54] yw :) [20:00:33] (03PS1) 10Ottomata: Add eventstreams.wikimedia.org to cache misc [puppet] - 10https://gerrit.wikimedia.org/r/322954 (https://phabricator.wikimedia.org/T143925) [20:03:06] (03CR) 1020after4: "conf/local/ENVIRONMENT doesn't work because we need to override it per-process not per-server." [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) (owner: 1020after4) [20:04:51] (03PS2) 10Ottomata: Add eventstreams.wikimedia.org to cache misc [puppet] - 10https://gerrit.wikimedia.org/r/322954 (https://phabricator.wikimedia.org/T143925) [20:12:46] RECOVERY - NTP on es2019 is OK: NTP OK: Offset -0.0005176663399 secs [20:20:07] (03CR) 10Ottomata: "Faidon thinks we should reuse stream.wikimedia.org for this, and I'd prefer if we could too. RCStream lives at 'http://stream.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/322954 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [20:34:46] (03PS1) 10Eevans: enable instance restbase2011-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322958 (https://phabricator.wikimedia.org/T151086) [20:35:25] (03CR) 10Eevans: [C: 04-1] "Not just yet (but soon)." [puppet] - 10https://gerrit.wikimedia.org/r/322958 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [20:35:57] (03CR) 10Filippo Giunchedi: "> I'm not familiar with that puppet compiler tool, does it mean that" [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [20:47:39] ema, around? [20:50:21] (03CR) 10Faidon Liambotis: "To elaborate:" [puppet] - 10https://gerrit.wikimedia.org/r/322954 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [20:55:17] (03PS2) 10Filippo Giunchedi: Don't send client caching headers for successful thumbnails in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/322280 (https://phabricator.wikimedia.org/T150642) (owner: 10Gilles) [20:56:45] (03CR) 10Filippo Giunchedi: [C: 032] Don't send client caching headers for successful thumbnails in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/322280 (https://phabricator.wikimedia.org/T150642) (owner: 10Gilles) [20:57:40] (03PS1) 10Ottomata: Allow misc directors to specify url path conditions as well as Host conditions [puppet] - 10https://gerrit.wikimedia.org/r/322964 [20:58:22] 06Operations, 10MediaWiki-Maintenance-scripts, 06Performance-Team, 10Thumbor: ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#2816014 (10fgiunchedi) @Gilles it is in WikimediaMaintenance Also related, the temp containers should have... [20:58:48] (03CR) 10Ottomata: "Totally untested PoC of a way to handle the cache misc url director selection: https://gerrit.wikimedia.org/r/#/c/322964/" [puppet] - 10https://gerrit.wikimedia.org/r/322954 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [20:59:20] (03CR) 10Ottomata: [C: 04-1] "This is a total PoC to see if we can handle url path based routing in cache misc director config for https://gerrit.wikimedia.org/r/#/c/3" [puppet] - 10https://gerrit.wikimedia.org/r/322964 (owner: 10Ottomata) [21:00:27] 06Operations, 10MediaWiki-Database, 07Performance: Use mysqli both in Zend and HHVM - https://phabricator.wikimedia.org/T149742#2761275 (10Paladox) @MaxSem apparently php5-mysqli is not a package, instead it is bundled in php5-mysql, we just need to add the extension to php.ini That is at least what happend... [21:01:46] PROBLEM - puppet last run on mc1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:27] 06Operations, 10MediaWiki-Database, 07Performance: Use mysqli both in Zend and HHVM - https://phabricator.wikimedia.org/T149742#2816027 (10MaxSem) 05Open>03Invalid D'oh, thanks Paladox! ``` maxsem@tin:~$ php5 -a Interactive mode enabled php > var_dump(function_exists('mysqli_connect')); bool(true) php... [21:04:03] 06Operations, 10MediaWiki-Database, 07Performance: Use mysqli both in Zend and HHVM - https://phabricator.wikimedia.org/T149742#2816031 (10Paladox) your welcome :) [21:04:46] (03CR) 10Filippo Giunchedi: [C: 032] Remove role::beta::trebuchet_testing [puppet] - 10https://gerrit.wikimedia.org/r/322405 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [21:05:01] (03PS3) 10Filippo Giunchedi: Remove role::beta::bastion [puppet] - 10https://gerrit.wikimedia.org/r/322404 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [21:06:19] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2816034 (10fgiunchedi) The perms should be fixed everywhere now @Gilles [21:06:30] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2816035 (10fgiunchedi) a:05fgiunchedi>03Gilles [21:14:24] (03PS1) 10MaxSem: Add discovery reports [puppet] - 10https://gerrit.wikimedia.org/r/322969 (https://phabricator.wikimedia.org/T147034) [21:14:40] (03CR) 10Filippo Giunchedi: [C: 032] Remove role::beta::bastion [puppet] - 10https://gerrit.wikimedia.org/r/322404 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [21:14:52] (03PS3) 10Filippo Giunchedi: Remove role::beta::trebuchet_testing [puppet] - 10https://gerrit.wikimedia.org/r/322405 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [21:14:59] (03CR) 10Filippo Giunchedi: [V: 032] Remove role::beta::trebuchet_testing [puppet] - 10https://gerrit.wikimedia.org/r/322405 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [21:17:48] (03PS1) 10Filippo Giunchedi: codfw: rename prometheus200[12] to prometheus200[34] [dns] - 10https://gerrit.wikimedia.org/r/322970 (https://phabricator.wikimedia.org/T151338) [21:18:48] (03CR) 10Filippo Giunchedi: [C: 032] codfw: rename prometheus200[12] to prometheus200[34] [dns] - 10https://gerrit.wikimedia.org/r/322970 (https://phabricator.wikimedia.org/T151338) (owner: 10Filippo Giunchedi) [21:22:13] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2816117 (10fgiunchedi) There was a mistake in host naming at the beginning (prometheus200[12] already exist as VMs), I've moved prometheus2001 to prometheus2003 and prometheus2002 to prome... [21:26:32] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2816123 (10GWicke) We do have some information about format support from Accept & User-Agent headers in regular image requests. Chrome for example send... [21:28:46] RECOVERY - puppet last run on mc1024 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:31:48] (03PS1) 1020after4: Phabricator: conf_env resources need phabricator package installed [puppet] - 10https://gerrit.wikimedia.org/r/322972 [21:32:58] (03CR) 10Paladox: [C: 031] Phabricator: conf_env resources need phabricator package installed [puppet] - 10https://gerrit.wikimedia.org/r/322972 (owner: 1020after4) [21:33:06] (03CR) 10Thcipriani: [C: 031] "removing -1 since, as I missed in dependent changes, role::beta::trebuchet_testing is gone now." [puppet] - 10https://gerrit.wikimedia.org/r/322406 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [21:43:51] 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2816208 (10Dzahn) @Florian Oh, very nice. That's cool to have another mentor on it. You are probably right about too much work for an insta... [22:01:06] (03CR) 10Dzahn: [C: 04-1] "oh, weird that it fails like that. thanks for compiling! .. looking" [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:02:33] (03PS2) 10Dzahn: openstack: split nova.pp into one class per file (autoload layout) [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) [22:02:47] (03CR) 10jenkins-bot: [V: 04-1] openstack: split nova.pp into one class per file (autoload layout) [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:04:59] (03CR) 10Dzahn: ".. have to solve rebase conflict" [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:09:15] (03Draft1) 10Paladox: Make setting ipv6 optional in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/322980 [22:09:18] (03Draft2) 10Paladox: Make setting ipv6 optional in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/322980 [22:09:53] (03PS3) 10Paladox: Make setting ipv6 optional in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) [22:14:18] (03CR) 10Chad: [C: 04-1] Make setting ipv6 optional in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:15:24] (03CR) 10Paladox: "@Chad" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:18:39] (03PS4) 10Paladox: Make setting ipv6 optional in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) [22:20:00] (03PS5) 10Paladox: Make setting ipv6 optional in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) [22:26:56] (03CR) 10Chad: Make setting ipv6 optional in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:27:39] (03PS3) 10Dzahn: openstack: split nova.pp into one class per file (autoload layout) [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) [22:28:15] (03CR) 10Paladox: Make setting ipv6 optional in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:28:20] (03CR) 1020after4: [C: 04-1] "@paladox: I'll fix it, one moment..." [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:28:42] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2816480 (10RobH) I should note that the spare pool system WMF4726 was purchased in December of 2015. It is 1/3rd of the way through its 3 year warranty. [22:29:45] (03CR) 10Paladox: "Oh ok thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:32:28] (03CR) 10Dzahn: "can you set a (fake) IPv6 address (from a range for testing maybe) in labs hiera to avoid the $realm check? what happens if it just finds " [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:34:44] (03PS1) 10Filippo Giunchedi: install_server: add prometheus partman [puppet] - 10https://gerrit.wikimedia.org/r/323056 (https://phabricator.wikimedia.org/T151338) [22:36:00] (03PS6) 1020after4: Phabricator: define vcs interfaces only when configured in hiera [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:36:42] jouncebot: next [22:36:43] In 1 hour(s) and 23 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161123T0000) [22:37:17] (03CR) 1020after4: "dzahn: we could fake the address but in labs we don't even need the vcs address at all..." [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:38:05] (03CR) 10Dzahn: "somewhere i wanted a "fake" IPv6 address before and i used one from "2001:db8:: –" [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:40:43] (03CR) 10Dzahn: "dunno, just seems the easiest to me to set some value in labs hiera, maybe that is enough and you dont even need to touch puppet" [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:40:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Move hhvm_exporter to its own package [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/322371 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [22:41:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Debian packaging [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/322372 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [22:41:33] (03CR) 1020after4: [C: 031] "http://puppet-compiler.wmflabs.org/4651/" [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:42:35] (03CR) 10Paladox: "Using this fake address fd3c:d735:a767:a3d9:ffff:ffff:ffff:ffff worked." [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:42:44] (03CR) 1020after4: "if that works, I'm ok with it..." [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:43:36] (03CR) 10Paladox: "I now get these errors" [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:43:42] (03PS2) 10Reedy: Deploy EmailAuth to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322792 (https://phabricator.wikimedia.org/T151015) (owner: 10Gergő Tisza) [22:43:46] (03CR) 10Reedy: [C: 032] Deploy EmailAuth to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322792 (https://phabricator.wikimedia.org/T151015) (owner: 10Gergő Tisza) [22:44:24] (03CR) 10Chad: [C: 031] "PS6 is what I was thinking of, considering this isn't default as undef anywhere else." [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:44:41] (03Merged) 10jenkins-bot: Deploy EmailAuth to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322792 (https://phabricator.wikimedia.org/T151015) (owner: 10Gergő Tisza) [22:44:54] (03CR) 10Paladox: [C: 031] Phabricator: define vcs interfaces only when configured in hiera [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:46:16] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:53:53] (03CR) 10Dzahn: "not trying to add the IP when none is set seems good, but that will also not fix that the service will fail to start, like in the errors P" [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:54:45] (03CR) 10Dzahn: [C: 031] Phabricator: define vcs interfaces only when configured in hiera [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:55:19] (03CR) 10Dzahn: "it's not a bad change, but it will not get us past the errors Paladox pasted, i think" [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:55:43] (03CR) 10Chad: "No, I'm pretty sure that failure is unrelated." [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:59:03] (03CR) 10Dzahn: [C: 032] Phabricator: define vcs interfaces only when configured in hiera [puppet] - 10https://gerrit.wikimedia.org/r/322980 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [22:59:17] !log reedy@tin Synchronized wmf-config/extension-list-labs: EmailAuth to beta T151015 (duration: 00m 55s) [22:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:29] T151015: Deploy EmailAuth extension to the beta cluster - https://phabricator.wikimedia.org/T151015 [23:00:06] PROBLEM - DPKG on mwdebug1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:00:33] !log reedy@tin Synchronized wmf-config/CommonSettings-labs.php: EmailAuth to beta T151015 (duration: 00m 57s) [23:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:06] merged, nothing happened on iridium [23:01:29] paladox: ^ but separately you will still need to set an actual IP if you want the service to start [23:01:39] Oh [23:01:51] maybe it's just v4 and you skip v6 in labs [23:02:02] but at least it would need a separate v4 [23:02:13] if you actually want to have that service running [23:02:18] with the current role [23:02:19] Yep :) [23:02:22] !log reedy@tin Synchronized wmf-config/InitialiseSettings-labs.php: EmailAuth to beta T151015 (duration: 00m 51s) [23:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:52] so maybe it's actually to request that IP for the labs project and add it in horizon [23:02:59] (03PS1) 10Filippo Giunchedi: debian: don't require default file [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/323071 [23:03:02] and then set a working one in hiera [23:03:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] debian: don't require default file [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/323071 (owner: 10Filippo Giunchedi) [23:03:51] tgr: It's vaguely live on beta and consistent on prod config [23:03:54] or if you only want to test other things and not the ssh service then needs more changes to skip more stuff [23:04:53] oh [23:04:55] the nicest of all options would be figure out how to make v6 work in labs instance [23:06:16] Yeh [23:06:41] ip4 addresses are now at the limit [23:07:06] mutante, paladox: Anyway, the blame for the group 'vcs' failure lies in a7c50b2. That create_resource() call in init.pp relies on the user/group in vcs.pp having been done already. Two problems: A) This dependency isn't declared anywhere, hence the failure, and B) vcs.pp never has a group{} define, just a user{}, which is wrong. [23:07:31] The IPv4/6 issue is ugly, but understandable, and twentyafterfour's patch we just landed will fix that. [23:07:37] oh and yes [23:07:41] thanks [23:07:56] Tested twentyafterfour fix after mutante merged it and it works [23:08:07] ive set the ipv4 but removed ipv6 :) [23:08:22] It's been on my todo list to untangle this spaghetti code in phab, I just haven't had the cycles. [23:09:04] :) [23:09:37] I guess we will need to add the dependency, but not sure what needs adding and where. [23:10:14] The whole thing needs refactoring tbh [23:10:32] yeh, i guess we could use phabricator instance for all the testing [23:10:54] ostriches we could create a seperate testing class for the refactor then rename it all when we are done [23:11:01] to prevent prod from failing :) [23:11:19] ostriches: indeed it needs a major refactoring [23:11:30] We didn't need that for gerrit. It's about making small incremental no-op changes that slowly move things to the right place :) [23:11:38] Oh [23:11:41] ostriches: a lot has changed and the puppet code hasn't kept up because it's difficult for me to get code review on puppet code [23:12:14] twentyafterfour: Bigger problem is the structure is kinda flawed from the beginning, which means any changes are ugly. Not your fault :) [23:12:37] ostriches: I know, I think paladox's idea is better (new class that eventually replaces the current one) [23:12:44] what he said about small incremental changes and all that :) sounds good [23:12:52] :) [23:12:59] because it's impossible to test both labs and prod without breaking one or the other [23:13:15] mutante: yes small incremental changes will take a year or two to get to where we need to be [23:13:21] Once we get prods one working on labs (may be fixed really really soon) we can remove labs [23:13:30] twentyafterfour: Or like a month, when I finally get to it :) [23:13:32] remove labs phabricator class and move it all to prod [23:13:32] and I'll probably quit in frustration before it's done [23:13:38] Step 0 tho: go through and remove anything that's outright dumb/unused [23:14:01] ostriches: it's almost all dumb and almost all of it is used [23:14:05] not much unused in there [23:14:16] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [23:14:45] the way out of "impossible to test both labs and prod" is what we are doing now though, actually run the prod class in labs, see what issues need to be addressed, get past them, until the class can actually be applied to an instance [23:14:49] twentyafterfour ostriches i found this in the error log [23:14:50] Nov 22 23:11:48 phabricator phd[2791]: [2016-11-22 23:11:48] EXCEPTION: (AphrontConnectionQueryException) Attempt to connect to app_user@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (4). at [/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:343] [23:14:50] Nov 22 23:11:48 phabricator phd[2791]: arcanist(), phabricator(), phutil(), security(), sprint(), wmf-ext-misc() [23:15:04] Crappy config/defaults. [23:15:08] That sucks. [23:15:10] Oh [23:15:12] i'd be careful about trying to fix it by creating yet another class [23:15:29] paladox: that is because labs doesn't have a mysql server for phabricator [23:15:40] Oh [23:15:42] Crappy defaults. [23:15:44] Like I said [23:15:45] we need to either spin up a mysql cluster in the phab project or run it locally [23:15:52] Default db names: always localhost. [23:15:55] I guess run it locally [23:15:55] Always always always [23:15:56] ostriches: not really, the defaults for labs were in the separate labs role [23:16:10] ostriches: localhost doesn't have a db [23:16:16] separate labs roles are evil. [23:16:17] apt-get install mariadb-server-10.1 [23:16:20] so it's not just the defaults [23:16:22] because in theory there are centralized db servers for testing, but not really.. right [23:16:31] apt-get install mariadb-server-10.0 [23:16:35] ostriches: value judgements don't help, this is where we are and we are trying to fix it [23:16:45] I wonder what the pass will be [23:16:46] LOL [23:17:33] twentyafterfour ostriches is there a way i can change the db to use localhose in the config file? [23:17:43] the labs setup is so different from prod that it made sense to have separate roles when it was originally built [23:17:46] since doing it manually, puppet will erase it. [23:18:02] thanks Reedy! except I forgot WMF wikis have real name disabled so there is no way to test it :( [23:18:13] bah :/ [23:18:14] paladox: need to move mysql.host to hiera it's currently hard-coded somewhere [23:18:15] lol [23:18:21] Oh [23:18:35] but this is a cycle. "labs is so different from prod" -> "make labs role" -> "labs is more different from prod" [23:18:40] twentyafterfour should i do that, though i doint fully know how to convert it to hiera [23:18:51] or would you like to do it please? [23:18:52] what I want to know is why is this stuff divided between the role and the phabricator module... most of what's in the role is a mess [23:19:13] mutante: I agree [23:19:29] mutante: that's why we are trying to make it work with the prod role [23:19:43] we are almost there [23:19:47] yes, totally, we all agree on that [23:19:53] to getting it working with prod role [23:20:00] i encouraged paladox to do just that , yea [23:20:46] I'm not clear about what things explicitly belong in the role and what belongs in phabricator module. Do we have any guidelines for what goes where? [23:21:05] :) [23:21:08] because right now it's a messy and seemingly arbitrary split between phabricator::* and role::phabricator::main [23:21:14] (03PS1) 10Gergő Tisza: Fix EmailAuth beta cluster enabling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323078 (https://phabricator.wikimedia.org/T151015) [23:21:19] Reedy: ^ [23:21:25] paladox: I'll try to figure out where it's hardcoded [23:21:30] Thanks [23:21:52] mutante: ostriches ^^^ do you guys have any pointers for me about what parts belong in the role? [23:22:18] like I'd probably put the interface::ip stuff in the phabricator::vcs class along with the parts that use it [23:22:21] twentyafterfour https://github.com/wikimedia/operations-puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/phabricator/manifests/init.pp#L79 [23:22:23] twentyafterfour: Roles should include minimal (if any) config, and only include module(s) [23:22:32] twentyafterfour: setting variables should be in the role, and it should use the module, while there are no hardcoded config values in the module itself [23:22:46] ok so this is totally wrong then [23:22:54] that's what I thought [23:22:55] Yes, it's completely and totally wrong :) [23:23:01] I'm working on a first patch [23:23:05] well that makes it a lot easier to fix [23:23:35] ostriches: ok so I should just forget it for now? we shouldn't both be working on the same file we will get a bunch of conflicts [23:23:45] :) [23:24:17] twentyafterfour https://github.com/wikimedia/operations-puppet/blob/0983592c8a8d1a50f89d9a323a6b5e6b4fd384d7/modules/role/manifests/phabricator/main.pp#L46 [23:24:27] https://github.com/wikimedia/operations-puppet/blob/0983592c8a8d1a50f89d9a323a6b5e6b4fd384d7/modules/role/manifests/phabricator/main.pp#L24 [23:24:36] oh, right [23:24:39] ^^ theres alot of mysql hard code in there [23:24:58] we are going to change that password at some point [23:25:00] paladox: yeah I'm trying to move that to hiera [23:25:07] but now it's easy to change it [23:25:08] Thanks [23:27:21] would you like it if we split the firewall rules from "main" into phabricator::firewall or something? [23:28:09] well, i have ideas for changes too but i'll wait a bit to not create just even more rebase conflicts [23:28:14] twentyafterfour: Eh, go ahead, it's a rabbit hole and I should be working on 1.28 [23:28:21] Sorry for half-licking your cookie [23:28:26] * ostriches sends it back mildly soggy [23:28:31] ewww, hehe [23:31:14] (03PS1) 10Filippo Giunchedi: Add hhvm_exporter role and class [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) [23:31:23] (03PS1) 10Chad: Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 [23:32:27] ^ My one tiny contribution [23:32:50] (03CR) 10Paladox: "This seemed to fix this. @Dzahn could you merge please?" [puppet] - 10https://gerrit.wikimedia.org/r/322972 (owner: 1020after4) [23:33:12] (03CR) 10Paladox: [C: 031] Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [23:33:17] (03PS2) 10Filippo Giunchedi: Add hhvm_exporter role and class [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) [23:34:36] (03PS1) 1020after4: phabricator: Move mysql hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/323082 [23:36:04] (03CR) 1020after4: [C: 031] Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [23:37:13] https://gerrit.wikimedia.org/r/#/c/323082/ moves the mysql_host to hiera [23:38:02] (03CR) 10Paladox: [C: 031] "We will also have to do mysql.user and mysql.password too please :)" [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:38:17] twentyafterfour thanks, i left a note ^^ [23:39:10] (03CR) 10Paladox: phabricator: Move mysql hostnames to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:39:26] RECOVERY - cassandra-b CQL 10.192.32.153:9042 on restbase2011 is OK: TCP OK - 0.036 second response time on 10.192.32.153 port 9042 [23:39:52] (03CR) 10Dzahn: "@AndrewBogott fixed http://puppet-compiler.wmflabs.org/4652/" [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:40:06] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:40:27] (03CR) 10jenkins-bot: [V: 04-1] Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [23:41:03] (03PS2) 10Dzahn: Phabricator: conf_env resources need phabricator package installed [puppet] - 10https://gerrit.wikimedia.org/r/322972 (owner: 1020after4) [23:41:23] (03CR) 10jenkins-bot: [V: 04-1] Add hhvm_exporter role and class [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [23:41:49] (03CR) 1020after4: phabricator: Move mysql hostnames to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:42:51] (03CR) 10Chad: "Well, this is *better* insofar as it allows labs to override it easier and removes the explicit config in the role." [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:43:43] (03CR) 10Chad: phabricator: Move mysql hostnames to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:43:57] (03CR) 10Paladox: phabricator: Move mysql hostnames to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:44:41] (03PS3) 10Filippo Giunchedi: Add hhvm_exporter role and class [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) [23:44:43] (03CR) 1020after4: "chad: That's great but we are working with small incremental changes, right?" [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:45:19] (03CR) 1020after4: phabricator: Move mysql hostnames to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:46:33] (03CR) 10Chad: "Indeed. I'm mainly curious if cleaning up the init file is the best first step so we can sort this in the right direction going forward :)" [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:46:34] ostriches: I really just want to trash role::phabricator::main and rewrite it but I keep hearing "small incremental changes'... [23:46:50] Hey, as long as the on-disk resources end up the same, go for it :D [23:47:13] ostriches: by 'cleaning up the init file' you mean phabricator/manifests/init.pp ? [23:47:17] Yep. [23:47:25] That is a bit of a mess but the role is way worse [23:47:44] I guess my point was: a clean init.pp allows you to move shit outta the role to where it belongs [23:47:52] But I'm going to stop backseat driving now, I promise. [23:47:53] I want to move most of the stuff from the role down into the various module classes, then refactor the module a bit more after that [23:48:11] That works [23:48:13] * ostriches shuts up [23:48:42] I'm sure there are some circular dependencies in here that will have to untangle but the role is the part that bothers me most right now [23:48:49] (03CR) 10Dzahn: "good question about m3-slave in codfw, i looked at dbtree.wikimedia.org and it says that m3 consists of: db1043, db1048 and db2012, so th" [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:51:02] twentyafterfour i guess the time has come for the big phabricator clean UP, LOL [23:51:16] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2816664 (10AndyRussG) Related: T151418 and T151419 [23:51:20] Lets create a phabricator-test puppet role (for the big refractor) [23:52:20] (03CR) 10Dzahn: "it looks like all database master/slave names like that are in eqiad, even when they point to something in codfw.. eh, compare this:" [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [23:52:59] host m5-slave.eqiad.wmnet .. is an alias for db2030.codfw [23:53:23] (03PS2) 10Gergő Tisza: Fix EmailAuth beta cluster enabling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323078 (https://phabricator.wikimedia.org/T151015) [23:53:24] it looks like all those master/slave names are in .eqiad. even when they point to codfw [23:55:47] yeah that's odd [23:55:56] host m3-slave.eqiad.wmnet is not an alias for codfw though [23:56:07] bblack: hey! got a few minutes this afternoon for some quick Varnish thoughts? [23:56:36] mutante: it's fine for now, phab isn't configured to query the slaves anyway (only the dump script uses a slave I believe) [23:56:45] especially for T151419 T149873 [23:56:45] T151419: Spike: CentralNotice: Is a Varnish banner/campaign quick flush switch feasible? - https://phabricator.wikimedia.org/T151419 [23:56:45] T149873: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873 [23:56:52] twentyafterfour: ok [23:57:35] (03CR) 10Dzahn: [C: 032] Phabricator: conf_env resources need phabricator package installed [puppet] - 10https://gerrit.wikimedia.org/r/322972 (owner: 1020after4) [23:57:44] ema: ^ ? [23:59:40] (03PS2) 10Dzahn: Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad)