[00:45:03] (03Abandoned) 10Aude: Update my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/363180 (owner: 10Aude) [00:45:34] 10Operations, 10Commons, 10Thumbor, 10Traffic, 10media-storage: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3442366 (10Jeff_G) New symptom: https://upload.wikimedia.org/wikipedia/commons/thumb/2/23/Ortega%2C_Juan_de_%E2%80%93_Tratado_subtilissimo_de_aritm... [02:31:54] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.7) (duration: 09m 13s) [02:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:35] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.9) (duration: 12m 48s) [03:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:03] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2952.00 Read Requests/Sec=2329.10 Write Requests/Sec=45.80 KBytes Read/Sec=40752.40 KBytes_Written/Sec=466.00 [03:12:50] would this be the place to report an error with phab? [03:13:52] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jul 17 03:13:51 UTC 2017 (duration 7m 16s) [03:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:23] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=188.60 Read Requests/Sec=216.30 Write Requests/Sec=4.90 KBytes Read/Sec=3807.20 KBytes_Written/Sec=537.20 [03:28:03] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 773.10 seconds [04:02:04] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 260.48 seconds [04:15:04] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=8238.00 Read Requests/Sec=3095.50 Write Requests/Sec=1295.00 KBytes Read/Sec=40918.00 KBytes_Written/Sec=11767.60 [04:18:14] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=9.90 Read Requests/Sec=0.10 Write Requests/Sec=15.60 KBytes Read/Sec=0.40 KBytes_Written/Sec=126.40 [05:09:29] !log Restart MySQL on labsdb1009 for maintenance - T170657 [05:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:41] T170657: labsdb1009 crashed while doing an alter table on templatelinks - https://phabricator.wikimedia.org/T170657 [05:14:33] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [05:16:06] ^ expected [05:16:33] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [05:21:40] !log Add 50G to /srv on db1069 [05:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:28] (03PS1) 10Giuseppe Lavagetto: [WiP] Re-write etcd driver [debs/pybal] - 10https://gerrit.wikimedia.org/r/365548 [05:52:34] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Re-write etcd driver [debs/pybal] - 10https://gerrit.wikimedia.org/r/365548 (owner: 10Giuseppe Lavagetto) [05:53:31] <_joe_> meh [05:54:30] (03PS1) 10Giuseppe Lavagetto: [WiP] Re-write etcd driver [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/365549 [05:54:52] (03Abandoned) 10Giuseppe Lavagetto: [WiP] Re-write etcd driver [debs/pybal] - 10https://gerrit.wikimedia.org/r/365548 (owner: 10Giuseppe Lavagetto) [05:56:06] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Re-write etcd driver [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/365549 (owner: 10Giuseppe Lavagetto) [06:32:53] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1031-bin.002179, end_log_pos 377549678 [06:33:03] <_joe_> uh [06:33:09] <_joe_> marostegui: ^^ [06:34:51] _joe_: I will take a look, thanks :) [06:36:29] (03CR) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 (owner: 10Giuseppe Lavagetto) [06:37:06] (03PS8) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 [06:37:08] (03PS6) 10Giuseppe Lavagetto: Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350 [06:37:10] (03PS6) 10Giuseppe Lavagetto: Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351 [06:37:12] (03PS9) 10Giuseppe Lavagetto: Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (https://phabricator.wikimedia.org/T169546) [06:37:27] (03CR) 10Giuseppe Lavagetto: [C: 032] Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 (owner: 10Giuseppe Lavagetto) [06:38:13] (03Merged) 10jenkins-bot: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 (owner: 10Giuseppe Lavagetto) [06:41:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365551 (https://phabricator.wikimedia.org/T166204) [06:43:27] (03CR) 10Brian Wolff: [C: 031] "[Note: legal may still need to approve this before it can be merged]" [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) (owner: 10Umherirrender) [06:43:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365551 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [06:45:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365551 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [06:46:08] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365551 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [06:47:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T166204 (duration: 01m 04s) [06:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:20] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [06:47:57] 10Operations, 10netops, 10Patch-For-Review: deploy diffscan2 - https://phabricator.wikimedia.org/T169624#3442654 (10MoritzMuehlenhoff) Two things I noticed (or we can move these to a new task): - For runs where no changes appeared, there should be no output/mail - Maybe we should setup a separate cron job t... [06:48:56] !log Deploy alter table on s1 - db1073 - T166204 [06:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:32] !log Stop replication on db1095 for maintenance - T153743 [06:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:43] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [06:50:44] (03CR) 10Alexandros Kosiaris: [C: 032] smokeping: restrict http access to prod networks [puppet] - 10https://gerrit.wikimedia.org/r/365434 (owner: 10Dzahn) [06:55:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365554 (https://phabricator.wikimedia.org/T168661) [07:00:16] !log Rename labsdb1011 main replication thread to an specific one - T153743 [07:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:26] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:03:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365554 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:04:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365554 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:05:15] !log marostegui@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [07:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365554 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:06:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 - T168661 (duration: 00m 46s) [07:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:41] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:07:30] (03CR) 10Marostegui: [C: 032] s6.hosts: Add labsdb1011 [software] - 10https://gerrit.wikimedia.org/r/365233 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [07:08:41] (03Merged) 10jenkins-bot: s6.hosts: Add labsdb1011 [software] - 10https://gerrit.wikimedia.org/r/365233 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [07:09:03] !log Deploy alter table s4 - db1056 - T168661 [07:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:39] !log Stop slave s2 on db1102 for maintenance - T153743 [07:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:49] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:14:13] (03CR) 10Muehlenhoff: "Removing base:firewall from stat1005 is fine, until we sort out a technical solution to the underlying problem with Spark. We can abandon " [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [07:18:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "/me likes this. Some inline comments. I am also adding moritzm, and _joe_ as reviewers" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [07:24:55] (03CR) 10Alexandros Kosiaris: [C: 04-2] "> 1. Why is there an ensure parameter on this class at all? Perhaps it should be removed." [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [07:28:53] (03CR) 10Muehlenhoff: "There used to be a separate tool called exiscan, but that was merged into exim quite a while ago, so I suppose Exim should have all that b" [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [07:32:12] !log updating ruthenium to nodejs 6.11 [07:32:14] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:13] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3442747 (10MoritzMuehlenhoff) @Arlolra : ruthenium has been updated to 6.11 [07:36:10] (03CR) 10Alexandros Kosiaris: "Indeed. I 've summarized this into https://phabricator.wikimedia.org/T170462" [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [07:48:07] (03CR) 10Muehlenhoff: [C: 031] "Looks fine, adding Guillaume as reviewer as well (who's currently on vacation), these are usually only merged during the fixed WDQS mainte" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [07:51:41] (03CR) 10Muehlenhoff: "The Hadoop bug is fixed upstream in mapreduce 2.9, but it's not packaged in CDH yet. This is tracked via T111433" [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [07:52:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add sandboxing directives to wdqs-blazegraph.service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [07:54:52] (03CR) 10Alexandros Kosiaris: "Overall looks good, any chance we can squash it to it's dependent patch since it hasn't been merged yet ?" [puppet] - 10https://gerrit.wikimedia.org/r/365416 (owner: 10Thcipriani) [07:55:33] !log upgrade nodejs to 6.11 on etherpad1001 [07:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:12] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3442776 (10akosiaris) [07:56:13] (03CR) 10Giuseppe Lavagetto: [C: 032] Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350 (owner: 10Giuseppe Lavagetto) [07:56:22] (03CR) 10Giuseppe Lavagetto: [C: 032] Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351 (owner: 10Giuseppe Lavagetto) [07:56:43] (03CR) 10Giuseppe Lavagetto: [C: 032] Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (https://phabricator.wikimedia.org/T169546) (owner: 10Giuseppe Lavagetto) [07:57:06] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3442792 (10akosiaris) etherpad tested with 6.11 and etherpad1001 has been upgraded already. [07:57:30] !log ema@neodymium conftool action : set/pooled=inactive; selector: name=wdqs1002.eqiad.wmnet [07:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:36] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: bump to code version 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/365555 (https://phabricator.wikimedia.org/T169546) [08:04:00] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] puppet-compiler: bump to code version 0.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/365555 (https://phabricator.wikimedia.org/T169546) (owner: 10Giuseppe Lavagetto) [08:06:02] (03CR) 10Smalyshev: "I would like to understand these directives before they are added. And I'd like to test these settings on a VM before they are merged to p" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [08:06:32] !log lvs2003: upgrade pybal to 1.13.9 T82747 T154759 [08:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:42] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [08:06:42] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [08:09:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add sandboxing directives to wdqs-blazegraph.service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [08:13:48] (03CR) 10Thibaut120094: [C: 031] Create 'rollbacker' user group in frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365538 (https://phabricator.wikimedia.org/T170780) (owner: 10Framawiki) [08:17:15] !log lvs100[39]: upgrade pybal to 1.13.9 T82747 T154759 [08:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:26] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [08:17:26] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [08:17:31] (03CR) 10Smalyshev: Add sandboxing directives to wdqs-blazegraph.service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [08:20:30] !log Increase expire_logs_days on db1069:3311 from 7 to 14 temporarily - T166204 [08:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:40] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [08:23:19] !log Deploy alter table s1 - labsdb1001 - T166204 [08:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:59] (03PS1) 10Giuseppe Lavagetto: role::configcluster: switch to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365559 [08:28:57] !log reboot helium/heze for kernel upgrades [08:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:15] akosiaris: lol I was using bconsole and everything froze, my heart stopped for a second :D [08:30:34] PROBLEM - Host heze is DOWN: PING CRITICAL - Packet loss = 100% [08:30:44] PROBLEM - Host helium is DOWN: PING CRITICAL - Packet loss = 100% [08:30:58] elukey: XDDDDDDDDD [08:31:24] RECOVERY - Host helium is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [08:32:04] RECOVERY - Host heze is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [08:32:47] ema: o/ - whenever you are done with the pybal upgrades can we schedule the reboots of conf1* hosts? [08:34:08] elukey: we can proceed any time, I'm done with the low-traffic upgrades [08:36:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add /data/ Redirect for commons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [08:37:25] elukey: woops [08:37:35] I 've checked that no jobs were running [08:37:47] but not that others were not logged in [08:38:26] akosiaris: no problem! I was just using bconsole for the first time and my attention level was probably over the roof D [08:38:26] (03CR) 10DCausse: [C: 031] "I think we should deploy this patch prior to deploying the patch in cirrus. Otherwise we'll run without poolcounter protection for morelik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365406 (https://phabricator.wikimedia.org/T170648) (owner: 10EBernhardson) [08:38:28] :D [08:39:58] (03CR) 10Lucas Werkmeister (WMDE): Add sandboxing directives to wdqs-blazegraph.service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [08:45:13] (03PS2) 10Lucas Werkmeister (WMDE): Add sandboxing directives to wdqs-blazegraph.service [puppet] - 10https://gerrit.wikimedia.org/r/365518 [08:47:26] (03PS2) 10Giuseppe Lavagetto: role::configcluster: switch to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365559 [08:51:16] (03PS3) 10Giuseppe Lavagetto: role::configcluster: switch to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365559 [08:53:23] (03CR) 10Alexandros Kosiaris: [C: 031] "+1ed on premise. Ping me when testing is done, I 'd be glad to merge." [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [08:55:24] 10Operations, 10Cassandra, 10Services (blocked): Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735#3442853 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [08:56:09] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987#3442854 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [08:56:40] 10Operations, 10Patch-For-Review: Reimage/rename codfw pool counters - https://phabricator.wikimedia.org/T149298#3442856 (10MoritzMuehlenhoff) This can be closed. [08:56:44] Wotcha -operations, anyone about who is familiar with dumps generation? Got a task T170741 created in response to an OTRS ticket [08:56:45] T170741: siteinfo-namespaces.json - unexpected character at line 1 column 1 of the JSON data - https://phabricator.wikimedia.org/T170741 [08:58:22] TheresNoTime: apergos might be able to help you or point you in the right direction [08:58:46] That'd be grand if they could :) [09:03:30] * apergos peeks in [09:03:52] 10Operations, 10Commons, 10Thumbor, 10Traffic, 10media-storage: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3442863 (10Aklapper) @Jeff_G: Please always also describe what you see instead of only posting links. Those two links work for me and I have no ide... [09:03:55] 10Operations, 10Patch-For-Review: Reimage/rename codfw pool counters - https://phabricator.wikimedia.org/T149298#3442864 (10akosiaris) 05Open>03Resolved [09:04:16] (03PS4) 10Giuseppe Lavagetto: role::configcluster: switch to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365559 [09:04:58] huh that's special [09:05:03] and new. [09:05:07] !log Disable puppet on labsdb1009 for maintenance - T153743 [09:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:18] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:07:54] apergos: I'm guessing it's a dodgy API call during generation? [09:08:54] (03PS2) 10Jcrespo: Revert "mariadb: Depool db2062 to clone it to db2072, and other hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365284 [09:09:50] I don't see it right off but I'll check into it [09:10:00] and get it fixed up by the next run (starting on the 20th) [09:10:13] Sounds great, thank you :) [09:11:53] thanks for the report [09:15:21] (03PS5) 10Giuseppe Lavagetto: role::configcluster: switch to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365559 [09:19:24] (03PS3) 10Muehlenhoff: Restrict ores::web to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/364185 [09:20:55] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3442897 (10ayounsi) Another note, some of those spikes don't match with the OSPF flaps. [09:21:48] (03CR) 10Muehlenhoff: [C: 032] Restrict ores::web to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/364185 (owner: 10Muehlenhoff) [09:22:19] !log Stop replication on labsdb1009 and labsdb1010 for maintenance - T153743 [09:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:31] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:24:38] !log Disable puppet on labsdb1010 for maintenance - T153743 [09:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:03] 10Operations, 10netops, 10Patch-For-Review: deploy diffscan2 - https://phabricator.wikimedia.org/T169624#3442952 (10ayounsi) 05Resolved>03Open [09:37:20] (03PS1) 10ArielGlenn: fix construction of path for api calls [dumps] - 10https://gerrit.wikimedia.org/r/365564 (https://phabricator.wikimedia.org/T170741) [09:41:14] (03PS4) 10Alexandros Kosiaris: Puppetmaster profile: Support switching off active records [puppet] - 10https://gerrit.wikimedia.org/r/365257 (owner: 10Andrew Bogott) [09:42:09] (03PS3) 10Jcrespo: Revert "mariadb: Depool db2062 to clone it to db2072, and other hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365284 [09:42:15] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2062 to clone it to db2072, and other hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365284 (owner: 10Jcrespo) [09:43:27] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2062 to clone it to db2072, and other hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365284 (owner: 10Jcrespo) [09:43:44] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2062 to clone it to db2072, and other hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365284 (owner: 10Jcrespo) [09:44:20] TheresNoTime: see ticket updates. [09:47:33] apergos: Oh how annoying! Well at least that's resolved, thank you [09:48:06] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2062 (duration: 00m 47s) [09:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:39] 10Operations, 10netops: pfw-codfw still logging to indium - https://phabricator.wikimedia.org/T170622#3443015 (10ayounsi) 05Open>03Resolved a:03ayounsi Done! Reopen if any issues. [09:53:33] (03CR) 10Alexandros Kosiaris: [C: 032] Puppetmaster profile: Support switching off active records [puppet] - 10https://gerrit.wikimedia.org/r/365257 (owner: 10Andrew Bogott) [09:53:40] yw! [09:53:54] (03CR) 10Alexandros Kosiaris: "PCC noop at https://puppet-compiler.wmflabs.org/compiler02/7071/" [puppet] - 10https://gerrit.wikimedia.org/r/365257 (owner: 10Andrew Bogott) [09:58:25] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3443024 (10Samwalton9) I'm told that Wiley "should be able to provide the necessary info this week." [09:58:35] (03CR) 10Alexandros Kosiaris: [C: 032] Adds aspell-el to ORES base.pp [puppet] - 10https://gerrit.wikimedia.org/r/365289 (https://phabricator.wikimedia.org/T170709) (owner: 10Halfak) [09:58:41] (03PS3) 10Alexandros Kosiaris: Adds aspell-el to ORES base.pp [puppet] - 10https://gerrit.wikimedia.org/r/365289 (https://phabricator.wikimedia.org/T170709) (owner: 10Halfak) [09:58:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Adds aspell-el to ORES base.pp [puppet] - 10https://gerrit.wikimedia.org/r/365289 (https://phabricator.wikimedia.org/T170709) (owner: 10Halfak) [09:59:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365567 [10:01:23] !log installing apache updates on otrs.wikimedia.org [10:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:56] !log installing apache updates on logstash [10:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:41] (03PS2) 10ArielGlenn: write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) [10:05:05] (03CR) 10jerkins-bot: [V: 04-1] write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [10:10:38] 10Puppet, 10Cloud-Services, 10Continuous-Integration-Infrastructure, 10Beta-Cluster-reproducible: New instances attached to a role::puppetmaster::standalone Puppetmaster need manual changes after switching from the default Puppetmaster - https://phabricator.wikimedia.org/T148929#3443054 (10hashar) Thank yo... [10:16:19] !log installing apache updates on graphite hosts [10:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:27] (03PS1) 10Giuseppe Lavagetto: motd::script: use validate_numeric for priority [puppet] - 10https://gerrit.wikimedia.org/r/365569 [10:28:29] (03PS1) 10Giuseppe Lavagetto: rsyslog::conf: validate priority with validate_numeric [puppet] - 10https://gerrit.wikimedia.org/r/365570 [10:28:31] (03PS1) 10Giuseppe Lavagetto: sysctl::conffile: validate priority as numeric [puppet] - 10https://gerrit.wikimedia.org/r/365571 [10:28:33] (03PS1) 10Giuseppe Lavagetto: role::configcluster: move to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365572 [10:28:35] (03CR) 10Faidon Liambotis: Puppetmaster profile: Support switching off active records (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365257 (owner: 10Andrew Bogott) [10:29:09] akosiaris: ^ [10:29:12] since you merged it :) [10:36:13] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3443097 (10Mvolz) Thanks for chasing this down @Samwalton9! [10:50:09] (03PS4) 10Muehlenhoff: Tighten access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) [10:51:11] Hey, I want my accounts back, and ready for video session if needed. in Wikimedia Armenia office right now [10:59:32] Amir1: o/ o/ [10:59:38] Hey! [10:59:44] nice to see you online :) [10:59:54] Thanks :) [11:00:26] maybe it is better to open a phab task? [11:00:32] not sure how to handle these cases [11:01:43] okay [11:04:09] (03CR) 10Muehlenhoff: [C: 031] "Two minor issues, but looks good to me!" (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [11:08:34] (03PS1) 10Giuseppe Lavagetto: Add base to make_diff [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/365577 [11:11:07] (03PS2) 10Giuseppe Lavagetto: Add base to make_diff [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/365577 [11:11:08] elukey: https://phabricator.wikimedia.org/T170801 [11:11:21] (security, obviously) [11:12:44] (03CR) 10Giuseppe Lavagetto: [C: 032] Add base to make_diff [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/365577 (owner: 10Giuseppe Lavagetto) [11:14:57] Amir1: super [11:15:02] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: bump to 0.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/365578 [11:15:04] (03PS1) 10Giuseppe Lavagetto: utils/pcc: add --future argument [puppet] - 10https://gerrit.wikimedia.org/r/365579 [11:16:43] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: bump to 0.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/365578 (owner: 10Giuseppe Lavagetto) [11:19:33] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [11:19:34] PROBLEM - WDQS HTTP on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.001 second response time [11:19:43] PROBLEM - WDQS SPARQL on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.001 second response time [11:29:46] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365567 [11:31:43] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365567 (owner: 10Marostegui) [11:32:48] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365567 (owner: 10Marostegui) [11:32:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365567 (owner: 10Marostegui) [11:33:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 - T168661 (duration: 00m 46s) [11:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:00] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [11:35:28] !log Deploy alter table on s4 - dbstore1002 - T168661 [11:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:37] (03CR) 10Elukey: [C: 031] kafkatee: send 4xx to logstash as well [puppet] - 10https://gerrit.wikimedia.org/r/365247 (owner: 10Filippo Giunchedi) [12:00:21] (03CR) 10Elukey: [C: 031] "Yes please! :)" [puppet] - 10https://gerrit.wikimedia.org/r/365228 (owner: 10Jcrespo) [12:23:39] (03CR) 10Volans: [C: 032] Query and grammar: add support for aliases [software/cumin] - 10https://gerrit.wikimedia.org/r/363748 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [12:25:51] (03Merged) 10jenkins-bot: Query and grammar: add support for aliases [software/cumin] - 10https://gerrit.wikimedia.org/r/363748 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [12:26:18] (03CR) 10Volans: [C: 032] QueryBuilder: fix subgroup close at the end of query [software/cumin] - 10https://gerrit.wikimedia.org/r/363749 (owner: 10Volans) [12:28:34] (03CR) 10Volans: "Not yet (the switchdc patch) because there are other breaking changes coming very soon and I'd rather do them all together when doing a re" [software/cumin] - 10https://gerrit.wikimedia.org/r/363750 (owner: 10Volans) [12:29:12] (03Merged) 10jenkins-bot: QueryBuilder: fix subgroup close at the end of query [software/cumin] - 10https://gerrit.wikimedia.org/r/363749 (owner: 10Volans) [12:39:26] (03PS8) 10DCausse: Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [12:40:09] (03PS3) 10ArielGlenn: write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) [12:40:20] (03CR) 10DCausse: Switch this repo to a deb package (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [12:40:30] (03CR) 10jerkins-bot: [V: 04-1] write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [12:42:10] (03PS1) 10ArielGlenn: make sure page range explain query can never be for 0 pages [dumps] - 10https://gerrit.wikimedia.org/r/365585 [12:49:17] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [13:00:06] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170717T1300). Please do the needful. [13:00:06] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:18] I can SWAT today! [13:01:38] Urbanecm: around for SWAT? [13:01:42] * zeljkof is reviewing 365084 [13:02:11] (03PS1) 10Gilles: Serve a synth error page when error body is empty in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/365589 (https://phabricator.wikimedia.org/T169683) [13:03:49] 10Operations, 10Huggle, 10Wikimedia-Mailing-lists: Huggle mailing list - need to regain owner access - https://phabricator.wikimedia.org/T170803#3443323 (10Luke081515) [13:04:39] zeljkof: he is not online currently :/ [13:04:42] (03PS1) 10Elukey: profile::piwik::backup: use quotes for the backup's password field [puppet] - 10https://gerrit.wikimedia.org/r/365590 (https://phabricator.wikimedia.org/T164073) [13:05:46] Sagan: um, then no deployment? [13:06:08] zeljkof: maybe he will arrive later during the SWAT hour? sometimes that's the case [13:06:11] because his commits are the only commits for deployment today [13:06:31] makes sense, will review everything and be ready if he comes [13:07:55] (03CR) 10Elukey: [C: 032] profile::piwik::backup: use quotes for the backup's password field [puppet] - 10https://gerrit.wikimedia.org/r/365590 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [13:08:19] (03CR) 10Zfilipin: [C: 031] Provide HD logos for several Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365084 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:09:33] (03CR) 10Luke081515: [C: 031] Allow uploads to autoconfirmed-only at huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365288 (https://phabricator.wikimedia.org/T169438) (owner: 10Urbanecm) [13:09:56] zeljkof: in case you want, we can start with that change, that is easy enough that I can take care of it, in case you would SWAT it :) [13:10:07] (https://gerrit.wikimedia.org/r/365288) [13:10:51] Sagan: sure, let me just finish the current review [13:10:55] ok :) [13:11:54] zeljkof, I'm here now :) [13:11:56] Urbanecm: you are a bit late :D [13:12:05] Urbanecm: welcome! [13:12:12] Sagan, yeah, some internet troubles :) [13:12:14] Sorry! [13:12:33] Urbanecm: so if you need a bit more time, we can start with https://gerrit.wikimedia.org/r/365288 as I offered, then I'd take care, or you can take over [13:12:37] just reviewing your patches, ok, will merge 365084 for start [13:12:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365084 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:13:01] Sagan, I can do everything needful now :) [13:13:10] Urbanecm: ok, then it's up to you now :) [13:14:00] Hello [13:14:05] Hi Dereckson [13:14:07] need help to review something? [13:14:08] (03Merged) 10jenkins-bot: Provide HD logos for several Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365084 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:16:13] (03CR) 10jenkins-bot: Provide HD logos for several Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365084 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:17:49] 10Operations: Frequent segfaults of rsvg-convert on image scalers - https://phabricator.wikimedia.org/T137876#3443394 (10Gilles) [13:18:36] Urbanecm: can you test 365084 at mwdebug? or should I do the full scap immediately? [13:18:58] zeljkof: Yes, I can test it there [13:19:08] which mwdebug this time? :) [13:19:12] Urbanecm: great, will be there in a minute, will ping you [13:19:25] Sagan, almost every path is with mwdebug ;) [13:19:35] Sagan: always the same one :) mwdebug1002 [13:19:55] Urbanecm: yeah, but last time we had a funny discussion which of those two. and it's funny if two people ask the exact same question in the same second :D [13:20:34] Sagan: mwdebug1001 is intended to be used by developers and automated testing processes, 1002 to be used for deployment [13:21:34] Dereckson: ah, ok. The last time I submitted a patch for SWAT was long time ago. At codfw, which ones is used there for deployment? 2017? [13:22:08] yeah, during datacenter switch, if you need write access to the database to test your patch, a codfw server is to be used instead [13:22:48] (03CR) 10Luke081515: [C: 031] Create 'rollbacker' user group in frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365538 (https://phabricator.wikimedia.org/T170780) (owner: 10Framawiki) [13:24:18] Urbanecm: 365084 is at mwdebug1002, please test [13:24:38] zeljkof, testing [13:29:48] (03CR) 10Zfilipin: [C: 031] Provide HD logos for several Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365086 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:29:53] zeljkof, working, please deploy [13:30:36] Urbanecm: great, will do [13:30:52] (03PS4) 10Urbanecm: Provide HD logos for several Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365086 (https://phabricator.wikimedia.org/T150618) [13:34:09] ugh, connection dropped for a minute, everything frozen, will reconnect and deploy in a minute [13:34:34] (03PS1) 10Elukey: profile::piwik::backup: remove duplicate old backup clean cron [puppet] - 10https://gerrit.wikimedia.org/r/365595 (https://phabricator.wikimedia.org/T164073) [13:35:07] (03PS2) 10Giuseppe Lavagetto: role::configcluster: move to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365572 [13:35:28] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#3443553 (10chasemp) 05Open>03Resolved Resolving in favor of T167559 which will have more details and hopefully some authoritative plan. I don't... [13:36:21] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:365084|Provide HD logos for several Wikipedias (T150618)]] (duration: 00m 48s) [13:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:33] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [13:37:18] !log zfilipin@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [13:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:01] Urbanecm: uh oh [13:38:06] 13:37:17 Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00) [13:38:08] zeljkof, what happened? [13:38:08] looking [13:39:03] 13:37:18 sync-file failed: scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [13:39:31] zeljkof, I don't understand what's the matter. Do we have to rollback? Is anything broken? [13:39:33] (03CR) 10Elukey: [C: 032] profile::piwik::backup: remove duplicate old backup clean cron [puppet] - 10https://gerrit.wikimedia.org/r/365595 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [13:40:29] Urbanecm: looking... [13:40:43] Ok [13:42:03] (03PS3) 10Giuseppe Lavagetto: role::configcluster: move to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365572 [13:42:13] the only thing that comes to my mind is that a logo is used in the config but not uploaded, but since there are so many of them, not sure how to quickly check [13:42:18] Urbanecm: ^ [13:42:27] nothing in the logs stands out for me [13:43:10] !log Deploy alter table on s4 - dbstore1001 - T168661 [13:43:12] zeljkof, I've written a simple bash script that tries to connect to mwdebug1002, download all logos and report if there was any 404. Can it help? [13:43:19] hashar (or anybody): any idea what is wrong? [13:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:22] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [13:43:22] 13:37:18 sync-file failed: scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [13:43:33] zeljkof: which canary was it? [13:43:39] Urbanecm: can you run it? [13:43:47] zeljkof, sure. Mwdebug or prod? [13:43:55] thcipriani|afk: 13:37:17 Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00) [13:44:21] zeljkof: an after error rate of 2 shouldn't be anything to worry about if it was only one canary [13:44:44] zeljkof: try resyncing...I need to fiddle with the logic of the error rate checker, me thinks [13:44:45] thcipriani|afk: so --force? [13:44:54] try running it without force at first [13:44:56] thcipriani|afk: ok, will do [13:45:28] thcipriani|afk: just one canary, as far as I can see: [13:45:31] RuntimeError: scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [13:46:30] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:365084|Provide HD logos for several Wikipedias (T150618)]] (duration: 00m 46s) [13:46:35] yeah, I think that's what I'll fiddle with, if the error rate increases by 10x on > 1/4 of the canaries: something like that [13:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:41] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [13:46:56] Urbanecm, thcipriani|afk: all good this time, no errors [13:46:59] thcipriani|afk: thanks! [13:47:21] zeljkof: yw. sorry for the added noise, but thanks for being cautious :) [13:47:27] Great. I was able to download all 120 logos I've uploaded in the patch. Maybe I should do smaller patches in the future? [13:48:05] Urbanecm: small patches are good patches :) but I don't know, what ever makes sense [13:48:27] thcipriani|afk: scap errors really scare me :) [13:48:36] Urbanecm: continuing with 365086 [13:48:39] as well they should :) [13:49:04] zeljkof, I just don't know the correct number of logos :) [13:49:48] Urbanecm: well, that's hard to tell [13:50:30] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365086 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:50:53] PROBLEM - puppet last run on labtestservices2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:22] (03Merged) 10jenkins-bot: Provide HD logos for several Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365086 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:52:34] (03CR) 10jenkins-bot: Provide HD logos for several Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365086 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:53:27] (03PS4) 10Urbanecm: Add HD logos for several Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) [13:54:35] Urbanecm: 365086 is at mwdebug1002, please test [13:54:44] zeljkof, testing [13:54:53] RECOVERY - puppet last run on labtestservices2002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:55:21] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3443582 (10elukey) Fixed the predump script that wasn't able to backup to `/srv/backup` on bohrium, now everything should be ok.... [13:55:56] 10Operations, 10Performance-Team, 10Thumbor: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3443583 (10Gilles) [13:56:40] (03PS1) 10Ottomata: Install /usr/bin/time on stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/365599 (https://phabricator.wikimedia.org/T170472) [13:57:04] (03PS2) 10Faidon Liambotis: scs-oe11-esams had old dns name of scs1 [dns] - 10https://gerrit.wikimedia.org/r/365079 (owner: 10RobH) [13:57:06] zeljkof, working, please deploy [13:57:17] Urbanecm: deploying... [13:57:23] (03CR) 10Faidon Liambotis: [C: 032] scs-oe11-esams had old dns name of scs1 [dns] - 10https://gerrit.wikimedia.org/r/365079 (owner: 10RobH) [13:57:44] (03CR) 10jerkins-bot: [V: 04-1] Install /usr/bin/time on stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/365599 (https://phabricator.wikimedia.org/T170472) (owner: 10Ottomata) [13:57:58] (03Abandoned) 10Ema: Release pybal 1.13.6 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [13:58:24] !log zfilipin@tin Synchronized static/images/project-logos: SWAT: [[gerrit:365086|Provide HD logos for several Wikiversities (T150618)]] (duration: 00m 47s) [13:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:35] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [13:59:39] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:365086|Provide HD logos for several Wikiversities (T150618)]] (duration: 00m 46s) [13:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:51] Urbanecm: deployed, please check [13:59:55] zeljkof, checking [14:00:39] Urbanecm: should I continue with the rest of the patches? can you stay longer? or would you like to move the remaining patches to another window? [14:01:01] zeljkof, it'll be great if you can continue! [14:01:24] zeljkof, by the way, it's working [14:01:24] (03CR) 10Ottomata: [V: 032 C: 032] "Dunno why jenkins is complaining about " two-space soft tabs not used"" [puppet] - 10https://gerrit.wikimedia.org/r/365599 (https://phabricator.wikimedia.org/T170472) (owner: 10Ottomata) [14:01:47] ottomata: don't override jenkins like that [14:01:56] it'll now V-1 every subsequent commit [14:02:06] Urbanecm: ok, continuing then [14:02:18] !log Extending EU SWAT [14:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:58] (03PS4) 10Ottomata: Decom RCStream [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) [14:03:14] (03PS1) 10Andrew Bogott: puppetmaster: Add an additional option, 'thin' for profile::puppetmaster::common::storeconfigs: [puppet] - 10https://gerrit.wikimedia.org/r/365600 [14:04:06] (03PS1) 10Faidon Liambotis: statistics: fix indentation (and a lint failure) [puppet] - 10https://gerrit.wikimedia.org/r/365601 [14:04:15] ottomata: ^ [14:04:25] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: Add an additional option, 'thin' for profile::puppetmaster::common::storeconfigs: [puppet] - 10https://gerrit.wikimedia.org/r/365600 (owner: 10Andrew Bogott) [14:04:29] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:04:52] (03Draft1) 10Paladox: Fix puppet lint [puppet] - 10https://gerrit.wikimedia.org/r/365602 [14:04:57] (03PS5) 10Elukey: Tighten access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) (owner: 10Muehlenhoff) [14:04:59] (03PS2) 10Paladox: Fix puppet lint [puppet] - 10https://gerrit.wikimedia.org/r/365602 [14:05:03] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3443667 (10jcrespo) FYI, we uniformized not a long time ago local dumps going to /srv/backups on database hosts, in case you wan... [14:05:05] (03CR) 10jerkins-bot: [V: 04-1] statistics: fix indentation (and a lint failure) [puppet] - 10https://gerrit.wikimedia.org/r/365601 (owner: 10Faidon Liambotis) [14:05:07] (03Abandoned) 10Paladox: Fix puppet lint [puppet] - 10https://gerrit.wikimedia.org/r/365602 (owner: 10Paladox) [14:05:26] (03Merged) 10jenkins-bot: Add HD logos for several Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:05:39] ah, thanks paravoid, i retabbed a differrent sectino that i thought it was complaining about and it didn't change anything, guess i missed that [14:05:39] thanks [14:05:52] ha, jenkins still not happy though [14:05:54] (03PS2) 10Faidon Liambotis: statistics: fix indentation (and a lint failure) [puppet] - 10https://gerrit.wikimedia.org/r/365601 [14:05:57] ottomata: it shows you the line number on the error [14:06:04] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3443674 (10elukey) @jcrespo sorry it is `/srv/backups`, missed a 's' in my last post :( [14:06:11] (03CR) 10jenkins-bot: Add HD logos for several Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:06:24] ya, paravoid i must have grabbed the wrong line number [14:06:27] mobrovac: hello! You there ? [14:07:02] (03CR) 10Faidon Liambotis: [C: 032] statistics: fix indentation (and a lint failure) [puppet] - 10https://gerrit.wikimedia.org/r/365601 (owner: 10Faidon Liambotis) [14:07:13] ottomata: jenkins is happy now [14:07:40] ottomata: but yeah, don't override jenkins like that, someone probably did that when that change was introduced too, and you're puzzled by it now who knows how many months/years later :) [14:08:03] (03PS2) 10Ema: Linux kernel module handling [puppet] - 10https://gerrit.wikimedia.org/r/365030 [14:08:09] Urbanecm: 365092 is at mwdebug1002, please test [14:08:31] k [14:08:55] zeljkof, testing [14:09:23] ottomata: well that someone was you apparently, and it was 5 days ago :P [14:09:24] (03CR) 10jerkins-bot: [V: 04-1] Linux kernel module handling [puppet] - 10https://gerrit.wikimedia.org/r/365030 (owner: 10Ema) [14:09:35] ottomata: https://gerrit.wikimedia.org/r/#/c/364773/ + https://gerrit.wikimedia.org/r/#/c/364782/ [14:09:47] paravoid: there was a moment when i was doing lots of stat box refactoring, but jenkins wasn't working, or taking many many minutes [14:09:49] usually i wait [14:09:58] this time i saw the error, but wrongfully pushed it through [14:10:01] so ya my fault [14:10:01] yeah looks like it never V+anything for those changes [14:10:23] zeljkof, working, deploy please [14:10:32] Urbanecm: deploying... [14:11:00] !log decommissioning rcs100[12] to spare::system: T170157 [14:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:10] T170157: Decommission RCStream - https://phabricator.wikimedia.org/T170157 [14:11:23] paper flipped [14:11:32] zeljkof: I am back around [14:11:54] hashar: situation normal, extended swat, more commits to deploy [14:11:56] (03PS2) 10Urbanecm: Run optipng -o7 at all PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365023 (https://phabricator.wikimedia.org/T170569) [14:11:56] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:365092|Add HD logos for several Wiktionaries (T150618)]] (duration: 00m 49s) [14:12:01] (03PS5) 10Ottomata: Decom RCStream [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) [14:12:05] some trouble, but thcipriani|afk was around to help [14:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:07] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [14:12:15] (03CR) 10Ottomata: [V: 032 C: 032] Decom RCStream [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) (owner: 10Ottomata) [14:12:38] (03CR) 10Ema: Linux kernel module handling (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365030 (owner: 10Ema) [14:13:12] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:365092|Add HD logos for several Wiktionaries (T150618)]] (duration: 00m 46s) [14:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:40] Urbanecm: 365092 deployed, please check [14:13:45] ema: ah right, silly me (re: lsmod) [14:13:55] zeljkof, checking [14:13:58] ema: jenkins is complaining, I think due to indentation failures [14:14:05] ema: but looks great overall! [14:14:16] paravoid: nice! [14:15:10] I wonder if blacklist/options needs a notify => Exec['update-initramfs'], but we currently don't do that, so it could be part of another commit [14:15:39] paravoid: oh, good point, somewhere we do notify update-initramfs IIRC [14:15:58] lvs::kernel_config for instance [14:16:05] that only matters for modules that would be loaded in early boot [14:16:10] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365023 (https://phabricator.wikimedia.org/T170569) (owner: 10Urbanecm) [14:17:06] zeljkof, where is https://en.wikipedia.org/static/images/project-logos/arwiktionary-2x.png for example? It is 404 for me [14:17:12] (ad https://gerrit.wikimedia.org/r/#/c/365092/) [14:17:21] (03Merged) 10jenkins-bot: Run optipng -o7 at all PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365023 (https://phabricator.wikimedia.org/T170569) (owner: 10Urbanecm) [14:17:32] (03CR) 10jenkins-bot: Run optipng -o7 at all PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365023 (https://phabricator.wikimedia.org/T170569) (owner: 10Urbanecm) [14:17:38] ema: <3 for doing this right! (validate_*, documentation etc.) [14:18:09] (03PS3) 10Ema: Linux kernel module handling [puppet] - 10https://gerrit.wikimedia.org/r/365030 [14:18:21] Urbanecm: same here :/ [14:18:29] zeljkof, what's same? [14:18:37] 404 [14:18:47] It was working on mwdebug... [14:19:02] And it is in the patch. Do you know how we can check what happened? [14:19:21] https://en.wikipedia.org/static/images/project-logos/arwiktionary-2x.png is on mwdebug1001 at least [14:19:46] and on prod as well, but looks like the logo URL has to be purged [14:20:10] hashar, that's true, with wgetting all URLs using X-Wikimedia-Debug header, everything works. But at prod ten logos disappeared. [14:20:13] 2017-07-17 14:19 Synchronized static/images/project-logos/: SWAT: [[gerrit:365092|Add HD logos for several Wiktionaries (T150618)]] (duration: 00m 49s) [14:20:13] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [14:20:21] mwscript purgeList.php --script=enwiki would do it I guess [14:20:30] hashar: should I run it? [14:20:35] yup [14:20:39] running [14:20:47] (03PS1) 10Andrew Bogott: Puppetmaster: Don't use storeconfig_thin. [puppet] - 10https://gerrit.wikimedia.org/r/365605 [14:20:50] then enter the above URL and Ctrl + D to finish [14:20:50] (03Abandoned) 10Andrew Bogott: puppetmaster: Add an additional option, 'thin' for profile::puppetmaster::common::storeconfigs: [puppet] - 10https://gerrit.wikimedia.org/r/365600 (owner: 10Andrew Bogott) [14:21:16] zeljkof: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Image_Cache_Purges :) [14:21:37] and eventually we could have scap to purge them automatically [14:21:46] paravoid: thanks for the patient review :) Jenkins seems happy now that the arrows are aligned. [14:21:57] hashar: just reading it [14:22:06] so, I have to run it for every logo? [14:23:35] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests, 10Patch-For-Review: Decommission RCStream - https://phabricator.wikimedia.org/T170157#3443765 (10Ottomata) [14:23:55] Urbanecm: this now works https://en.wikipedia.org/static/images/project-logos/arwiktionary-2x.png [14:24:16] do I need to run purgeList for more logos? [14:24:19] yes [14:24:42] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests, 10Patch-For-Review: Decommission RCStream - https://phabricator.wikimedia.org/T170157#3443768 (10Ottomata) @Robh, I started the decommission process for rcs1001 and rcs1002, but may have taken it farther than I should. I followed... [14:25:00] hashar: "yes" as in "you have to run it for every logo" :) [14:25:18] zeljkof: yeah you can generate the list of URL based on the list of files changed in the last few patches [14:25:23] or just blindly purge all logos [14:25:35] hashar: how do I do it for all logos? [14:25:44] since the change was pretty big [14:25:55] zeljkof, at https://pastebin.com/ZMWhpB38 there are URLs for all logos I've changed [14:26:02] (03CR) 10Faidon Liambotis: [C: 032] "Excellent! Ideas for future enhancements:" [puppet] - 10https://gerrit.wikimedia.org/r/365030 (owner: 10Ema) [14:26:15] ls -1 static/images/project-logos|sed -e 's%^%https://en.wikipedia.org/static/images/project-logos/%' [14:26:22] zeljkof: ^ ^ that should do it [14:26:34] hashar, at https://pastebin.com/ZMWhpB38 there are URLs for logos I've changed ;) [14:26:43] hashar: trying that [14:26:55] then pipe that to mwscript purgeList.php --wiki=enwiki [14:27:05] or yeah that pastebin [14:27:24] hashar: can we add a linter for common commit message pitfalls? [14:27:37] zeljkof: use raw, https://pastebin.com/raw/ZMWhpB38 [14:27:38] e.g. periods in the first line, > 78 chars in the first line etc. [14:27:40] (03PS1) 10Andrew Bogott: Nodepool: move 'rate' to 6 seconds. [puppet] - 10https://gerrit.wikimedia.org/r/365608 (https://phabricator.wikimedia.org/T170492) [14:27:54] paravoid: there is some commit message validator written in python floating around. IIRC that is used by VisualEditor [14:27:58] (03PS2) 10Faidon Liambotis: puppetmaster: Don't use storeconfig_thin [puppet] - 10https://gerrit.wikimedia.org/r/365605 (owner: 10Andrew Bogott) [14:28:09] andrewbogott: ^^^ I fixed your commit message [14:28:37] andrewbogott: s/Puppetmaster/puppetmaster/;s/\.$//;s/ / /, plus proper line wrapping length in the body [14:28:47] your nodepool one needs a similar fix [14:28:52] ok, thanks [14:29:00] hashar: can we enabled that in ops/puppet? :) [14:29:04] s/enabled/enable/ [14:29:21] paravoid: https://pypi.python.org/pypi/commit-message-validator !! [14:29:23] (03PS2) 10Andrew Bogott: nodepool: move 'rate' to 6 seconds [puppet] - 10https://gerrit.wikimedia.org/r/365608 (https://phabricator.wikimedia.org/T170492) [14:29:27] (03PS9) 10DCausse: Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [14:29:32] paravoid: so we could probably add it to the tox/python part [14:29:44] andrewbogott: still two spaces between "nodepool:" and "move" [14:30:02] (03PS3) 10Andrew Bogott: nodepool: move 'rate' to 6 seconds [puppet] - 10https://gerrit.wikimedia.org/r/365608 (https://phabricator.wikimedia.org/T170492) [14:30:26] hashar, Urbanecm: this run purgeList for _every_ logo is a mess [14:30:33] :/ [14:30:53] zeljkof, don't understand. I've a wget log so I can provide only 404 logos if you wish ;) [14:31:12] Urbanecm: not sure what to do [14:31:40] this is the third commit with logos, and there is at least one more, should we run purgeList for every logo? [14:31:47] from all commits [14:31:48] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [14:31:56] or just ones returning 404 [14:32:05] zeljkof, not at all. Just logos that returns 404 [14:32:22] can you send me the list of 404s then [14:32:25] please [14:32:30] so I will run the script for them [14:32:48] 10Puppet, 10ORES, 10Scoring-platform-team: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440098 (10Halfak) 05Open>03Resolved [14:32:53] (03CR) 10Andrew Bogott: [C: 032] nodepool: move 'rate' to 6 seconds [puppet] - 10https://gerrit.wikimedia.org/r/365608 (https://phabricator.wikimedia.org/T170492) (owner: 10Andrew Bogott) [14:32:58] (03PS1) 10Hashar: Introduce commit-message-validator for nice commit messages [puppet] - 10https://gerrit.wikimedia.org/r/365609 [14:34:00] Urbanecm: can you send me the list of 404s please? [14:34:06] zeljkof, yes [14:34:31] !log Run maintain-views on labsdb1009,10 and 11 for s6 - T153743 [14:34:35] !log changing nodepool rate to '6' and restarting nodepool [14:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:43] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [14:34:49] zeljkof, https://pastebin.com/raw/KHSWvapY [14:34:50] andrewbogott: no need to restart nodepool :) [14:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:01] andrewbogott: it catch up with the new config automatically [14:35:08] Urbanecm: great, thanks, running the script [14:35:08] hashar: oh? does it reload config automatically? [14:35:14] please check the previous commits too [14:35:19] Anyway, I restarted it :( [14:35:22] andrewbogott: yeah it should. I am not sure how often, but fast enough usually. [14:35:25] zeljkof, I checked them already [14:35:30] they are working [14:35:36] andrewbogott: no worries. It should catch up just fine (tm) [14:35:36] Urbanecm: great, thanks [14:36:12] yw [14:36:38] PROBLEM - mysqld processes on db1102 is CRITICAL: PROCS CRITICAL: 3 processes with command name mysqld [14:36:51] Oh no, lost downtime? :( [14:36:59] volans akosiaris ^ :( [14:37:25] Urbanecm: please check all logos again [14:37:25] let me check the logs to make sure it was downtimed for longer [14:37:26] <_joe_> ouch [14:37:44] zeljkof, checking [14:38:23] Urbanecm: continuing with 365023 [14:39:23] !log Created account "Biplab Anand" at bureaucrat level on mai.wikimedia (T168782) [14:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:34] T168782: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782 [14:39:42] 10Operations, 10netops, 10Patch-For-Review: deploy diffscan2 - https://phabricator.wikimedia.org/T169624#3443854 (10ayounsi) @MoritzMuehlenhoff For your first item, I tested the following locally and it seems to work fine: https://github.com/XioNoX/diffscan2/compare/master...XioNoX:quiet Per IRC conversatio... [14:39:46] Yep, it is a lost downtime :( [14:40:00] zeljkof, ack [14:40:14] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Firewalls appear to be preventing spark executors from talking to spark driver on stat1005 - https://phabricator.wikimedia.org/T170496#3443855 (10MoritzMuehlenhoff) [14:40:38] (03PS4) 10Eevans: Configure an additional data file directory [puppet] - 10https://gerrit.wikimedia.org/r/365081 [14:41:44] Urbanecm: 365023 is at mwdebug1002, please check [14:43:08] (03PS1) 10Ema: Instrumentation fixes [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/365610 (https://phabricator.wikimedia.org/T103882) [14:43:53] It wasn't a lost downtime \o/ [14:44:11] (03PS1) 10Elukey: profile::piwik::database: puppetize Piwik's database config [puppet] - 10https://gerrit.wikimedia.org/r/365611 (https://phabricator.wikimedia.org/T164073) [14:45:26] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3443867 (10Dereckson) >>! In T168782#3437940, @Urbanecm wrote: > @Dereckson Please run createAndPromote.php because the author o... [14:45:28] zeljkof, working [14:45:35] Please deploy [14:45:39] (03PS2) 10Elukey: profile::piwik::database: puppetize Piwik's database config [puppet] - 10https://gerrit.wikimedia.org/r/365611 (https://phabricator.wikimedia.org/T164073) [14:45:41] (03PS2) 10Hashar: Introduce commit-message-validator for nice commit messages [puppet] - 10https://gerrit.wikimedia.org/r/365609 [14:46:15] !log reboot conf1001 for kernel updates [14:46:16] paravoid: https://gerrit.wikimedia.org/r/#/c/365609/ would add a commit message validator. I have no idea what kind of rules it enforces though [14:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:44] Urbanecm: ok, deploying [14:47:01] zeljkof, thx [14:47:32] Urbanecm: just checking, should we continue? [14:47:38] there are a couple of commits left [14:47:57] !log zfilipin@tin Synchronized static/images/: SWAT: [[gerrit:365023|Run optipng -o7 at all PNGs (T170569)]] (duration: 00m 47s) [14:47:57] zeljkof, if you are okay with it I have no problem. [14:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:08] T170569: Run optipng at all PNGs in operations/mediawiki-config - https://phabricator.wikimedia.org/T170569 [14:48:17] Urbanecm: then, forward [14:48:18] (03CR) 10Hashar: "The commit-message-validator has been introduced for VisualEditor iirc. The source code is in integration/commit-message-validator and the" [puppet] - 10https://gerrit.wikimedia.org/r/365609 (owner: 10Hashar) [14:49:23] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [14:49:37] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [14:49:43] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:49:57] <_joe_> elukey: ^^ [14:50:00] <_joe_> what's up? [14:50:06] <_joe_> did you do something? [14:50:14] not on conf2002 [14:50:21] <_joe_> where [14:50:22] I rebooted conf1001 [14:50:27] <_joe_> uhm [14:50:36] checking pybal on eqiad LVSs [14:50:53] I didn't downtime it because I thought it wasn't needed [14:50:56] <_joe_> seems like we were connected to conf1001 [14:51:00] <_joe_> I was wrong [14:51:26] argh I am stupid and I didn't check sorry [14:51:52] <_joe_> no, I told you it was conf1002 [14:51:54] <_joe_> anyways [14:51:57] <_joe_> did it reboot? [14:52:11] yup it's up [14:52:22] yep very quickly [14:52:23] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [14:52:26] Urbanecm: can you test 365288 at mwdebug? [14:52:29] <_joe_> ok [14:52:37] zeljkof, yes [14:52:38] <_joe_> replica is back up [14:52:38] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.074 second response time [14:52:53] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [14:53:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365288 (https://phabricator.wikimedia.org/T169438) (owner: 10Urbanecm) [14:55:16] Urbanecm: CI looks busy, it might be a while (merge)... [14:55:21] (03CR) 10jerkins-bot: [V: 04-1] profile::piwik::database: puppetize Piwik's database config [puppet] - 10https://gerrit.wikimedia.org/r/365611 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [14:55:22] _joe_: so now lvs1003/1006 (low-traffic, updated to pybal 1.13.9) both have established connections to conf1001 [14:55:25] zeljkof, ack [14:55:26] all others don't [14:55:47] <_joe_> ema: you mean the others are failing? [14:56:24] !log Deploy, manually, alter tables on enwiki on db1047 - T166204 [14:56:27] elukey: ^ [14:56:34] _joe_: yes [14:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:37] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [14:56:47] <_joe_> ema: so our fix works? [14:57:04] !log restart pybal on lvs100[45] T169765 [14:57:08] _joe_: yes :( [14:57:10] gh [14:57:10] :) [14:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:16] <_joe_> cool [14:57:16] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [14:57:17] nice :) [14:58:20] _joe_ I thought that there was a mirror maker running on conf1002 but can't find it, so is it only running on conf2002 (polling from eqiad only) ? [14:58:57] !log restart pybal on lvs100[12] T169765 [14:58:59] (03CR) 10Andrew Bogott: [C: 032] "I don't immediately need this, but it certainly won't hurt." [puppet] - 10https://gerrit.wikimedia.org/r/365171 (owner: 10Dzahn) [14:59:07] (03PS2) 10Andrew Bogott: add mapped IPv6 address for labtestervices200x [puppet] - 10https://gerrit.wikimedia.org/r/365171 (owner: 10Dzahn) [14:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:09] Urbanecm: looks like there is a merge conflict for 365288 [14:59:22] (03CR) 10Ottomata: "I MADE THE ENSURE PARAM!?" [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [14:59:26] (03PS2) 10Urbanecm: Allow uploads to autoconfirmed-only at huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365288 (https://phabricator.wikimedia.org/T169438) [14:59:42] zeljkof, rebase was enough. [14:59:46] (03CR) 10Ottomata: "Ah phew, ok, when I made it it worked! Folks just didn't keep it up :p" [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [15:00:29] (03CR) 10Zfilipin: Allow uploads to autoconfirmed-only at huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365288 (https://phabricator.wikimedia.org/T169438) (owner: 10Urbanecm) [15:00:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365288 (https://phabricator.wikimedia.org/T169438) (owner: 10Urbanecm) [15:01:54] PROBLEM - Check systemd state on conf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:02:06] checking --^ [15:02:17] probably needs a reset-failed for mirrormaker [15:02:33] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [15:02:38] (03Merged) 10jenkins-bot: Allow uploads to autoconfirmed-only at huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365288 (https://phabricator.wikimedia.org/T169438) (owner: 10Urbanecm) [15:02:50] (03CR) 10jenkins-bot: Allow uploads to autoconfirmed-only at huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365288 (https://phabricator.wikimedia.org/T169438) (owner: 10Urbanecm) [15:03:03] !log uploaded Linux 4.9.30-2+deb9u2 backport to jessie-wikimedia [15:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:06] (03CR) 10Alexandros Kosiaris: [C: 031] Introduce commit-message-validator for nice commit messages [puppet] - 10https://gerrit.wikimedia.org/r/365609 (owner: 10Hashar) [15:04:09] 10Operations, 10Ops-Access-Requests: Requesting access to recommendation-api for nschaaf - https://phabricator.wikimedia.org/T170592#3436068 (10RobH) Having the ability to restart the api service is a sudo right, and thus we have to review this request in an operations meeting. Additionally, the following wil... [15:04:33] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2026157 [15:05:37] Urbanecm: 365288 is at mwdebug [15:05:54] RECOVERY - Check systemd state on conf1001 is OK: OK - running: The system is fully operational [15:06:12] zeljkof, working, please deploy [15:06:27] Urbanecm: ok [15:06:34] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [15:07:35] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:365288|Allow uploads to autoconfirmed-only at huwiki (T169438)]] (duration: 00m 47s) [15:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:49] T169438: Disable file upload for new users on huWiki - https://phabricator.wikimedia.org/T169438 [15:08:04] Urbanecm: deployed, please check [15:08:23] zeljkof, working [15:08:33] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [15:08:45] Urbanecm: great, and the final one, 361066... [15:09:28] zeljkof, there is a script to be run (syntax in comments) FYI [15:10:06] Urbanecm: saw it, thanks for the instructions [15:10:17] (03PS3) 10Elukey: profile::piwik::database: puppetize Piwik's database config [puppet] - 10https://gerrit.wikimedia.org/r/365611 (https://phabricator.wikimedia.org/T164073) [15:10:33] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 41.38% of data above the critical threshold [1800.0] [15:10:48] (03PS1) 10Ottomata: logrotate reportupdater logs as proper user/group [puppet] - 10https://gerrit.wikimedia.org/r/365614 (https://phabricator.wikimedia.org/T152712) [15:12:28] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [15:12:51] (03PS7) 10Zfilipin: Set collation for Romanian wikis to uca-ro-u-kn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [15:13:02] (03CR) 10Zfilipin: Set collation for Romanian wikis to uca-ro-u-kn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [15:13:11] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [15:14:45] (03Merged) 10jenkins-bot: Set collation for Romanian wikis to uca-ro-u-kn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [15:16:08] (03CR) 10jenkins-bot: Set collation for Romanian wikis to uca-ro-u-kn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [15:18:33] Urbanecm: can you test 361066 at mwdebug? [15:18:35] (03PS4) 10Ema: pybal: bind instrumentation TCP port to private addresses [puppet] - 10https://gerrit.wikimedia.org/r/348074 (https://phabricator.wikimedia.org/T103882) [15:18:47] zeljkof, I can't test it at all. [15:19:07] Urbanecm: ok, then deploying [15:21:14] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:361066|Set collation for Romanian wikis to uca-ro-u-kn (T168711)]] (duration: 00m 47s) [15:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:25] T168711: Changing the alphabetical sorting (collation) @ ro.wikipedia.org - https://phabricator.wikimedia.org/T168711 [15:21:34] Urbanecm: deployed, running scripts... [15:21:44] zeljkof, thank you [15:22:24] (03CR) 10Ema: "> instrumentation is used in internal scripts for deployment." [puppet] - 10https://gerrit.wikimedia.org/r/348074 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [15:22:37] (03CR) 10Mobrovac: [C: 031] Tighten access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) (owner: 10Muehlenhoff) [15:22:48] (03CR) 10Elukey: [C: 032] profile::piwik::database: puppetize Piwik's database config [puppet] - 10https://gerrit.wikimedia.org/r/365611 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [15:22:55] (03PS4) 10Elukey: profile::piwik::database: puppetize Piwik's database config [puppet] - 10https://gerrit.wikimedia.org/r/365611 (https://phabricator.wikimedia.org/T164073) [15:22:58] (03CR) 10Elukey: [V: 032 C: 032] profile::piwik::database: puppetize Piwik's database config [puppet] - 10https://gerrit.wikimedia.org/r/365611 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [15:23:03] 10Operations, 10ops-codfw: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3444008 (10Papaul) @Dzahn I received 1 main board can you please depool mw2201 so I can go ahead and replacement the main board? Thanks. [15:23:07] (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster: Don't use storeconfig_thin [puppet] - 10https://gerrit.wikimedia.org/r/365605 (owner: 10Andrew Bogott) [15:27:51] Urbanecm: the first script is still running [15:27:59] my meetings start in a few minutes [15:28:13] I will monitor the scripts and let you know when everything finishes [15:28:25] thanks for deploying with #releng! :) [15:28:42] zeljkof, it may take a big amount of time. Sorry that I didn't tell you about that [15:28:46] Thank you for the deploys! [15:29:04] !log EU SWAT finished! (updateCollation.php still running in the background) [15:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:09] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3444056 (10Papaul) a:05Papaul>03ayounsi @ayounsi The wiring is complete on all 4 devices I didn't plug the fiber from pfw3a and pfw3b to cr1 and cr2 yet. Let me... [15:32:02] !log starting table compressing at db2072 (lag is possible) [15:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:40] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3444099 (10ema) So at first sight it looked like T169893 fixed this issue, but that's not the case. In particular, after conf1001 had been rebooted today I've noticed that both lvs1003 a... [15:35:53] !log restart pybal on lvs100[36] T169765 [15:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:06] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [15:37:05] (03PS1) 10Urbanecm: Provide HD logos for several Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365618 (https://phabricator.wikimedia.org/T150618) [15:38:37] (03PS1) 10Gilles: Send Thumbor error log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) [15:38:39] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3444121 (10Papaul) p:05Triage>03Normal [15:47:25] !log Stop MySQL on pc2006 - T170520 [15:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:39] T170520: pc2006 crashed - https://phabricator.wikimedia.org/T170520 [15:49:53] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [15:50:06] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3444181 (10Arlolra) >>! In T170548#3442747, @MoritzMuehlenhoff wrote: > @Arlolra : ruthenium has been updated to 6.11 Thanks, I started an rt test run. [15:54:39] 10Operations, 10Icinga, 10monitoring: Icinga loses downtime entries, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3444235 (10Volans) 05Open>03Resolved a:03Volans After few days without incident seems that we can call it resolved! \o/ [16:02:37] 10Operations, 10Huggle, 10Wikimedia-Mailing-lists: Reset admin password for Huggle mailing list - https://phabricator.wikimedia.org/T170803#3444298 (10Aklapper) [16:04:14] (03CR) 10Jforrester: [C: 031] "Planned for tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360371 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [16:04:19] (03CR) 10Jforrester: [C: 031] "Planned for tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [16:05:52] 10Operations, 10Huggle, 10Wikimedia-Mailing-lists: Reset admin password for Huggle mailing list - https://phabricator.wikimedia.org/T170803#3444321 (10RobH) 05Open>03Resolved a:03RobH I've gone ahead and performed a password reset via the mailman script, so its automatically emailed the new admin passw... [16:11:56] (03PS3) 10Faidon Liambotis: Introduce commit-message-validator for nice commit messages [puppet] - 10https://gerrit.wikimedia.org/r/365609 (owner: 10Hashar) [16:12:01] (03CR) 10Faidon Liambotis: [C: 032] Introduce commit-message-validator for nice commit messages [puppet] - 10https://gerrit.wikimedia.org/r/365609 (owner: 10Hashar) [16:12:54] hasharAway: thanks for that!! (commit-message-validator) [16:12:56] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=mw2201.codfw.wmnet [16:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:34] 10Operations, 10ops-codfw: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3444410 (10Dzahn) @Papaul Ok, thanks. Done. you can go ahead. ``` mw2201.codfw.wmnet: pooled changed yes => no mw2201.codfw.wmnet: pooled changed yes => no ``` [16:16:14] (03CR) 10jenkins-bot: Introduce commit-message-validator for nice commit messages [puppet] - 10https://gerrit.wikimedia.org/r/365609 (owner: 10Hashar) [16:16:21] halfak: outstanding for ~10h CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores is alerting: 5xx rate (Change prop) alert. [16:17:06] SMalyshev, dcausse: a bunch of WDQS alerts are CRITICAL [16:17:10] Thanks for the note paravoid. We're still working on getting grafana to notify right. [16:17:19] Hopefully this isn't a pain in the ass for you or anyone else. [16:18:13] paravoid: sorry I can't help on WDQS :( [16:18:23] ok, who can? [16:18:29] SMalyshev [16:18:39] (given gehel is on vacation and he's usually my point person :) [16:19:10] yes, wdqs is maintained by stas and gehel, I never logged into these servers :/ [16:19:14] PROBLEM - Host mw2201 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:20] ok :) [16:21:41] (03PS2) 10Ottomata: logrotate reportupdater logs as proper user/group [puppet] - 10https://gerrit.wikimedia.org/r/365614 (https://phabricator.wikimedia.org/T152712) [16:21:49] (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7084/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/365614 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [16:23:06] RainbowSprinkles: lemme know when you want to put the Minerva deploy work to bed :) [16:23:58] (03PS1) 10Elukey: profile::piwik::database: relax disk flush policy to reduce iowait [puppet] - 10https://gerrit.wikimedia.org/r/365632 (https://phabricator.wikimedia.org/T164073) [16:24:39] (03PS1) 10Faidon Liambotis: aptrepo: drop files/log, use templates/log.erb [puppet] - 10https://gerrit.wikimedia.org/r/365633 [16:24:54] (03CR) 10Elukey: [V: 032 C: 032] profile::piwik::database: relax disk flush policy to reduce iowait [puppet] - 10https://gerrit.wikimedia.org/r/365632 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [16:25:20] (03PS2) 10Faidon Liambotis: aptrepo: drop files/log, use templates/log.erb [puppet] - 10https://gerrit.wikimedia.org/r/365633 [16:26:09] <_joe_> SMalyshev: we have high lag in wdqs replication on all nodes in eqiad, should we do something about that? [16:26:36] _joe_: yeah see above :) [16:26:38] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: drop files/log, use templates/log.erb [puppet] - 10https://gerrit.wikimedia.org/r/365633 (owner: 10Faidon Liambotis) [16:27:06] <_joe_> heh, I just noticed the icinga alerts being around for more than ~ 30 mins [16:27:47] (03PS3) 10Faidon Liambotis: aptrepo: drop files/log, use templates/log.erb [puppet] - 10https://gerrit.wikimedia.org/r/365633 [16:28:12] wdqs1002 seems to be down, which maybe is the cause for high lag on the others? [16:28:15] just guessing :) [16:29:54] (03PS1) 10Ottomata: Backup eventlogging log data from stat1005 srv-log-eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/365634 (https://phabricator.wikimedia.org/T152712) [16:29:56] <_joe_> actually I think each one has its own replicator, they're independent [16:31:14] (03PS11) 10Dduvall: contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) [16:31:28] (03PS2) 10Ottomata: Backup eventlogging log data from stat1005 srv-log-eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/365634 (https://phabricator.wikimedia.org/T152712) [16:32:47] (03Abandoned) 10Ottomata: Backup eventlogging log data from stat1005 srv-log-eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/365634 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [16:35:52] (03PS1) 10Ottomata: Move geowiki from stat1003 to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365640 (https://phabricator.wikimedia.org/T152712) [16:37:16] (03PS2) 10Ottomata: Move geowiki from stat1003 to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365640 (https://phabricator.wikimedia.org/T152712) [16:37:35] (03Abandoned) 10Dduvall: [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) (owner: 10Dduvall) [16:38:24] 10Operations, 10Pybal, 10Traffic: Icinga check for pybal HTTP connections to etcd - https://phabricator.wikimedia.org/T170847#3444578 (10ema) [16:40:41] 10Operations, 10Pybal, 10Traffic, 10monitoring: Icinga check for pybal HTTP connections to etcd - https://phabricator.wikimedia.org/T170847#3444606 (10Volans) [16:41:39] <_joe_> !log trying to revive pdfrender on scb1002, the usual bug with its restarts [16:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:14] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3444618 (10mmodell) Still waiting on merge of https://gerrit.wikimedia.org/r/#/c/355869/ [16:43:40] ebernhardson: ping - need me to swat that change at 11am? [16:43:57] cc @thcipriani (https://phabricator.wikimedia.org/T170648) [16:44:03] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [16:46:33] ebernhardson: jdlrobson My current plan (which I haven't posted on all the tickets) is to backport the two changes needed during swat and then roll train forward. [16:47:20] thcipriani: can you be more specific with which two changes you mean? [16:47:30] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3218909 (10greg) Status? [16:47:46] 10Operations, 10Ops-Access-Requests: Requesting access to recommendation-api for nschaaf - https://phabricator.wikimedia.org/T170592#3444632 (10RobH) I put in the wrong group, @mobrovac confirmed in ops meeting this is actually requesting service deployer rights. No one in the meeting objected. [16:48:04] mobrovac: ^ please confirm i put int he right group this time, and we can triage after meeting =] [16:48:10] kk [16:49:14] jdlrobson: I guess three patches (misread): https://gerrit.wikimedia.org/r/#/c/365405/ https://gerrit.wikimedia.org/r/#/c/365406/ https://gerrit.wikimedia.org/r/#/c/365420/ [16:49:29] jdlrobson: the patches from https://phabricator.wikimedia.org/T170648 [16:49:32] ok great thcipriani :) [16:49:47] thcipriani: Can we then enable MinervaNeue skin (cc @RainbowSprinkles ) [16:49:53] i want to make sure we're good for next train [16:50:05] 10Operations, 10Ops-Access-Requests: Requesting access to recommendation-api for nschaaf - https://phabricator.wikimedia.org/T170592#3444647 (10mobrovac) Right. @schana is the service owner and maintainer, so he needs to be able to both deploy the service and directly restart it on the SCB hosts. Hence, he nee... [16:50:38] robh added a comment on the phab ^ [16:51:24] mobrovac: thanks for confirmation, its appreciated. this was approved so ill patchset and merge later today [16:51:26] jdlrobson: I want RainbowSprinkles to weigh-in on that one since he was involved with the inital rollout - I just rolled back a bunch of changes so I'm not really up-to-date on the MinervaNeue skin [16:52:12] 10Operations, 10Ops-Access-Requests: Requesting access to recommendation-api for nschaaf - https://phabricator.wikimedia.org/T170592#3444684 (10RobH) a:03RobH Thanks! As noted, this was approved in the operations weekly meeting. I'll claim and merge a patchset later today. [16:52:51] robh: great, thnx! [16:52:57] 10Operations, 10Ops-Access-Requests: Requesting access to recommendation-api for nschaaf - https://phabricator.wikimedia.org/T170592#3444690 (10RobH) p:05Triage>03Normal [16:53:07] i dunno how the large 'request' icon got on that task [16:53:11] but i wanna know, seems new. [16:53:45] thcipriani: sure just focus on wmf9 for time being :) then we're move on to the Minerva problem :) [16:54:21] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/7085/" [puppet] - 10https://gerrit.wikimedia.org/r/365640 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [16:54:55] (03CR) 10Zfilipin: "All UpdateCollation.php have finished" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [16:55:07] Urbanecm: All UpdateCollation.php have finished! [16:56:54] thcipriani: sounds reasonable to me. If caching is a problem again the graph of 'eqiad qps' on this dashboard should show a big spike: https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=42&fullscreen&orgId=1&from=now-30m&to=now [16:57:25] that only records succesfull requests, but last time it jumped from 90qps to 250-350 [16:58:07] ebernhardson: cool, I'll watch that during rollout [16:58:11] err, 'eqiad qps current' [16:59:28] zeljkof, That's great! [17:01:53] (03PS1) 10Muehlenhoff: Remove specific version annotation for nginx [puppet] - 10https://gerrit.wikimedia.org/r/365650 [17:02:13] PROBLEM - Check systemd state on notebook1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:14] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:15] (03CR) 10EBernhardson: "agreed this one should go first. will ship it in SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365406 (https://phabricator.wikimedia.org/T170648) (owner: 10EBernhardson) [17:13:53] PROBLEM - DPKG on notebook1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:14:38] I've just adjusted ORES alterting from Grafana so that it should calm down faster (essentially, null errors == zero errors) [17:16:52] (03PS2) 10Dzahn: cache::misc: remove unused director netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/365435 [17:21:23] (03PS1) 10RobH: adding nschaaf to deploy-service and recommendation-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/365658 [17:22:20] (03CR) 10RobH: [C: 032] adding nschaaf to deploy-service and recommendation-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/365658 (owner: 10RobH) [17:22:32] (03CR) 10Dzahn: [C: 032] cache::misc: remove unused director netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/365435 (owner: 10Dzahn) [17:23:07] (03PS2) 10RobH: adding nschaaf to deploy-service and recommendation-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/365658 [17:24:39] (03CR) 10jerkins-bot: [V: 04-1] adding nschaaf to deploy-service and recommendation-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/365658 (owner: 10RobH) [17:25:27] wtf [17:26:23] (03PS1) 10Jdlrobson: Log all events for page previews in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365659 [17:26:30] my commit message is legit [17:26:34] i guess too long in title... [17:26:41] this was new to me too 17:24:36 ERROR: commit-message: commands failed [17:27:10] (03PS2) 10Jdlrobson: Log all events for page previews in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365659 [17:27:16] yeah, i guess the subject is too long? [17:27:18] (03PS3) 10RobH: adding nschaaf to two sudo groups [puppet] - 10https://gerrit.wikimedia.org/r/365658 [17:27:22] ive amended and will see [17:27:42] oh, look [17:27:45] 17:24:17 The following errors were found: [17:27:45] 17:24:17 Line 7: Expected one space after 'Bug:' [17:27:45] if it passes, then that is likely it cuz the rest matches the required info [17:27:49] oh? [17:28:00] after the Bug: before the task? [17:28:04] like Bug: Task [17:28:05] yea [17:28:08] or Bug:task_ [17:28:09] ok [17:28:15] Bug: T1234 [17:28:15] T1234: Restrict Bugzilla access to read-only - https://phabricator.wikimedia.org/T1234 [17:28:29] that is stupid failure =P [17:28:33] fixing [17:28:33] (03PS4) 10RobH: adding nschaaf to two sudo groups [puppet] - 10https://gerrit.wikimedia.org/r/365658 (https://phabricator.wikimedia.org/T170592) [17:29:10] just checking that we're good to merge beta cluster changes outside of the swat window, right? [17:29:33] providing that we update the deployment server as well [17:29:42] (03PS5) 10RobH: adding nschaaf to two sudo groups [puppet] - 10https://gerrit.wikimedia.org/r/365658 (https://phabricator.wikimedia.org/T170592) [17:30:04] haha, my damn repo is out of sync, so thats 5 patches for a commit message, i am highly amused ^_^ [17:31:31] (03CR) 10Faidon Liambotis: [C: 032] aptrepo: drop files/log, use templates/log.erb [puppet] - 10https://gerrit.wikimedia.org/r/365633 (owner: 10Faidon Liambotis) [17:31:40] (03PS4) 10Faidon Liambotis: aptrepo: drop files/log, use templates/log.erb [puppet] - 10https://gerrit.wikimedia.org/r/365633 [17:32:00] right, merging a beta-cluster only config change and updating the deployment host (iirc this is fine) [17:32:14] (03CR) 10Faidon Liambotis: [V: 032 C: 032] aptrepo: drop files/log, use templates/log.erb [puppet] - 10https://gerrit.wikimedia.org/r/365633 (owner: 10Faidon Liambotis) [17:32:32] (03CR) 10Ottomata: [C: 032] Move geowiki from stat1003 to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365640 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [17:32:34] (03PS5) 10Faidon Liambotis: aptrepo: drop files/log, use templates/log.erb [puppet] - 10https://gerrit.wikimedia.org/r/365633 [17:32:38] (03CR) 10Faidon Liambotis: [V: 032 C: 032] aptrepo: drop files/log, use templates/log.erb [puppet] - 10https://gerrit.wikimedia.org/r/365633 (owner: 10Faidon Liambotis) [17:32:45] (03PS3) 10Ottomata: Move geowiki from stat1003 to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365640 (https://phabricator.wikimedia.org/T152712) [17:32:45] (03CR) 10Ottomata: [V: 032 C: 032] Move geowiki from stat1003 to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365640 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [17:32:58] (03PS4) 10Ottomata: Move geowiki from stat1003 to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365640 (https://phabricator.wikimedia.org/T152712) [17:33:02] (03CR) 10Ottomata: [V: 032 C: 032] Move geowiki from stat1003 to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365640 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [17:33:12] (03CR) 10Phuedx: [C: 032] Log all events for page previews in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365659 (owner: 10Jdlrobson) [17:33:21] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to recommendation-api for nschaaf - https://phabricator.wikimedia.org/T170592#3444838 (10RobH) 05Open>03Resolved a:05RobH>03None This has been merged live, and all affected hosts will get the update within 30 minutes or so whe... [17:33:23] paravoid: merging yours too, ya? [17:33:29] ottomata: sure [17:35:30] mobrovac: restbase2001 has critical alerts [17:35:53] (03Merged) 10jenkins-bot: Log all events for page previews in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365659 (owner: 10Jdlrobson) [17:36:13] paravoid: yup, on it [17:36:31] (03CR) 10jenkins-bot: Log all events for page previews in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365659 (owner: 10Jdlrobson) [17:37:05] done [17:37:14] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_clone_geowiki-data-private] [17:38:03] 10Operations, 10wikitech.wikimedia.org: Update mediawiki on wikitech-static - https://phabricator.wikimedia.org/T170854#3444900 (10Andrew) [17:38:06] !log mobrovac@tin Started deploy [restbase/deploy@f5ca520]: Bringing restbase2001 up to date [17:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:25] !log mobrovac@tin Finished deploy [restbase/deploy@f5ca520]: Bringing restbase2001 up to date (duration: 01m 21s) [17:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:03] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15600 bytes in 0.133 second response time [17:41:23] stat1006 ^^^ is me, working on it [17:43:14] SMalyshev: ping again? [17:43:47] paravoid: yes? [17:43:58] SMalyshev: see above, a bunch of wdqs alerts [17:44:03] (03PS1) 10Addshore: Revert "Remove time command from statistics::wmde crons" [puppet] - 10https://gerrit.wikimedia.org/r/365662 [17:44:14] (03PS2) 10Addshore: Revert "Remove time command from statistics::wmde crons" [puppet] - 10https://gerrit.wikimedia.org/r/365662 [17:44:23] SMalyshev: wdqs1002 throws 503 for HTTP/SPARQL and wdqs1001-1003 all show as "high lag" [17:44:30] paravoid: yes, I see, somebody is spamming the service again :( [17:44:41] I wonder do we have means to do ip blocks on varnish? [17:44:50] 1002 is in maintenance, it's ok [17:45:12] the alert is unacknowledged [17:45:24] I am not sure what that means :) [17:45:45] it means that it's alerting :) [17:46:18] I can set a downtime for a period, what kind of period would that be? [17:46:19] yeah I know, it's ok, it's in maintenance. Is there a way to stop it? [17:46:26] I did set the downtime afaik [17:46:50] In Scheduled Downtime? [17:46:50] NO [17:46:53] nope [17:47:09] it's fine, I'll do it now [17:47:13] when should it expire? [17:48:14] hmm maybe I forgot for this one. I set it again now, does it work? [17:48:34] yes :) [17:49:22] (03CR) 10Ottomata: [C: 032] Revert "Remove time command from statistics::wmde crons" [puppet] - 10https://gerrit.wikimedia.org/r/365662 (owner: 10Addshore) [17:49:33] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [17:50:02] hm systemd-udev-settle.service failed on rb2001 [17:50:09] fixed it [17:50:15] oh? [17:50:23] dunno why, I just started it [17:50:23] indeed! [17:50:25] thnx paravoid! [17:50:41] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365663 (https://phabricator.wikimedia.org/T170844) [17:50:42] if it happens again, let's investigate more [17:50:45] paravoid: question about blocking by ip still remains though. Is it possible? [17:50:55] it's possible but very adhoc [17:51:02] and it's generally a game of whack-a-mole too [17:52:09] the traffic team has been working on rate limits to make this more generic [17:52:13] cf. https://phabricator.wikimedia.org/T163233 [17:52:46] well, I have rate limits but it's too high for this particular spam [17:53:31] should be fixed at the app layer really though [17:53:43] paravoid: not sure what you mean? [17:53:44] varnish can't know how expensive a query is etc. [17:53:54] the WDQS layer, in this case :) [17:53:56] paravoid: I know how expensive it is [17:54:16] I just want to block that source so I don't have to deal with that spam [17:54:19] yeah, the varnish layer doesn't, so it can't make those decisions to block it [17:54:36] I already made the decision :) [17:54:37] blocking IPs doesn't scale (nor works in many cases) [17:55:00] I'm talking about software and you're talking about a human decision [17:55:03] I don't really need many cases... [17:55:08] (03PS1) 10Ottomata: Use https gerrit url to clone geowiki data-public on stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365666 (https://phabricator.wikimedia.org/T152712) [17:55:27] we can't be examining requests by hand on a case by case basis, that's not really how we do things :) [17:56:34] paravoid: ok, so how you do things? I have somebody who is spamming the service with heavy queries that bring it down [17:56:55] fix your service so that it rate-limits (or blocks) abusive/expensive queries [17:57:18] paravoid: it is impossible, there's no way to know upfront how expensive the query is [17:57:25] and by the time it's done, it's too late [17:57:35] the damage is already done [17:58:06] kill queries that run > N seconds maybe? [17:58:11] I really have no insight on how WDQS works [17:59:08] paravoid: I do :) that's why I say it's impossible to know the query is too expensive until it's done. It already kills queries longer than 60 secs. The problem is it's too late for those particular ones [17:59:11] but we can't be really having people investigate queries and find an IP and push a commit to block it everytime some random person on the internet decides to run such a query [17:59:21] (or having these conversations) [17:59:39] (03CR) 10Ottomata: [C: 032] Use https gerrit url to clone geowiki data-public on stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365666 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170717T1800). Please do the needful. [18:00:05] RoanKattouw and ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:15] o/ [18:00:21] I'll SWAT! [18:00:24] Yay [18:01:19] blocking a particular IP is addressing the symptom, not the root cause [18:01:27] paravoid: ok, so there's no way to fix this problem that you could suggest to me. Thanks, I'll look elsewhere [18:01:45] Niharika: welcome to SWAT team by the way :) [18:01:46] post-swat I'll follow up with wmf.9 train [18:02:06] well, my suggestion is to fix the problem, rather than "fixing" what triggered it [18:02:07] Dereckson: Thank you! It's fun. :) [18:02:39] Niharika: same what Dereckson said glad to have you running SWATs :) [18:02:43] Oooh welcome [18:03:02] I'm afraid we can't assist in helping with playing whack-a-mole with abusers, no :) [18:03:02] 10Operations, 10ops-codfw, 10Parsoid, 10Patch-For-Review, 10Services (watching): wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3445034 (10RobH) @papaul: This was already depooled, and is now shutdown and in maint-mode in icinga for the next 24 hours. Please run the RAM test... [18:03:10] paravoid: you mean rewrite the whole service so that it could look into the future and know which query might be too heavy for it. And do it in the next couple of hours or so so we can bring the service back up. I don't think that solves the problem, teally [18:03:13] *really [18:03:13] papaul: ^ so that is all ready for you to run ram test, it should be powered down [18:03:45] :) [18:03:49] uh [18:03:50] ok, whatever [18:04:41] Why are our Group 2 wikis not up-to-date? http://tools.wmflabs.org/versions/ Did I miss an email? [18:06:07] !log mobrovac@tin Started deploy [changeprop/deploy@f80c333]: (no justification provided) [18:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:22] Niharika: yeah, the new numbering scheme [18:06:25] see engineering@ [18:06:40] (03CR) 10Dzahn: "any concerns about adding that?" [puppet] - 10https://gerrit.wikimedia.org/r/355869 (https://phabricator.wikimedia.org/T164810) (owner: 10Dzahn) [18:06:45] Niharika: tl;dr: I will get group2 wikis on wmf.9 after I merge patches for https://phabricator.wikimedia.org/T170648 [18:07:07] oh maybe I misunderstood, sorry [18:07:13] thcipriani is obviously authoritative :) [18:07:21] Ahh, okay. Thanks thcipriani and paravoid. :) [18:07:24] !log mobrovac@tin Finished deploy [changeprop/deploy@f80c333]: (no justification provided) (duration: 01m 17s) [18:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:51] :) [18:09:14] (03PS2) 10Niharika29: Configure CirrusSearch-MoreLike pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365406 (https://phabricator.wikimedia.org/T170648) (owner: 10EBernhardson) [18:09:25] (03CR) 10Niharika29: [C: 032] Configure CirrusSearch-MoreLike pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365406 (https://phabricator.wikimedia.org/T170648) (owner: 10EBernhardson) [18:10:10] Niharika: nothing to test here really, the code that uses this config hasn't shipped yet. Just preparing [18:10:28] ebernhardson: Okay, so just sync? [18:10:31] yup [18:10:48] 10Operations, 10ops-codfw, 10Parsoid, 10Patch-For-Review, 10Services (watching): wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3445066 (10RobH) p:05Normal>03High [18:10:55] (03Merged) 10jenkins-bot: Configure CirrusSearch-MoreLike pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365406 (https://phabricator.wikimedia.org/T170648) (owner: 10EBernhardson) [18:11:08] (03CR) 10jenkins-bot: Configure CirrusSearch-MoreLike pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365406 (https://phabricator.wikimedia.org/T170648) (owner: 10EBernhardson) [18:11:49] (03PS1) 10Ottomata: Set up rsync module for /home on stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/365668 (https://phabricator.wikimedia.org/T152712) [18:12:01] 10Operations, 10MW-1.30-release-notes, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3445069 (10Benoit_Rochon) @Dcljr : all interwikis are working fine. Regarding sysop actions, it's the Pow-wow season an... [18:12:55] !log mobrovac@tin Started deploy [restbase/deploy@f5ca520]: Activate dinwiki support [18:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:16] (03CR) 10Ottomata: [C: 032] Set up rsync module for /home on stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/365668 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:16:13] !log niharika29@tin Synchronized wmf-config/PoolCounterSettings.php: Configure CirrusSearch-MoreLike pool counter [mediawiki-config] - https://gerrit.wikimedia.org/r/365406 (T170648) (duration: 02m 54s) [18:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:25] T170648: CirrusSearch/includes/Elastica/PooledHttp.php:67 Timeout reached waiting for an available pooled curl connection! - https://phabricator.wikimedia.org/T170648 [18:17:03] (03PS1) 10Ottomata: Remove public data geowiki push [puppet] - 10https://gerrit.wikimedia.org/r/365670 (https://phabricator.wikimedia.org/T152712) [18:17:31] 10Operations, 10ops-codfw: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3445099 (10Papaul) @Dzahn main board replacement on mw2201 complete.Please test and let me know. thanks [18:17:31] thcipriani: mw2201.codfw had a connection timeout. Do you know what's up with that? [18:17:39] I can retry a sync... [18:17:51] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3445100 (10Arlolra) [18:18:13] Niharika: I do not know what's up with that... [18:18:32] ah, it's in the SAL [18:18:34] 16:12 dzahn@neodymium: conftool action : set/pooled=no; selector: name=mw2201.codfw.wmnet [18:18:50] that's because the hardware is being replaced today [18:19:00] So I don't need to worry about that, right? [18:19:06] gotcha, for some reason still in the dsh files for scap :( [18:19:12] no, i hope it doesnt block you, dont worry [18:19:18] Okay, thanks! [18:19:26] (03CR) 10Ottomata: [C: 032] Remove public data geowiki push [puppet] - 10https://gerrit.wikimedia.org/r/365670 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:19:56] is anyone else getting "UNRECOVERABLE FATAL ERROR" in phab? [18:20:33] !log mobrovac@tin Finished deploy [restbase/deploy@f5ca520]: Activate dinwiki support (duration: 07m 39s) [18:20:40] Zackary hi, where are you getting that error [18:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:46] thcipriani: i'd remove it but i literally just got the message that it should be fixed.. on it [18:20:53] paladox: https://phabricator.wikimedia.org/maniphest/report/ [18:21:02] thanks [18:21:23] it's timeing out [18:21:24] which is known [18:21:34] oh [18:21:37] also, i wanted it to still be in sync with deploy [18:22:05] Sometimes it will work and other times it wont [18:22:06] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3445111 (10Smalyshev) [18:22:33] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3445124 (10Smalyshev) p:05Triage>03High [18:22:45] RoanKattouw: Your patch is on mwdebug1002. Please test. :) [18:24:44] paladox: ah, thanks for the help~ [18:24:51] (03PS1) 10Ottomata: Remove dependency on geowiki public data job [puppet] - 10https://gerrit.wikimedia.org/r/365672 (https://phabricator.wikimedia.org/T152712) [18:24:53] your welcome :) [18:25:47] Oh hah enwiki isn't on wmf9 yet [18:25:53] Zackary https://phabricator.wikimedia.org/T125357 [18:26:30] (03PS1) 10Dzahn: install_server: update MAC address of mw2201 in DHCP [puppet] - 10https://gerrit.wikimedia.org/r/365673 (https://phabricator.wikimedia.org/T170307) [18:26:48] RoanKattouw: Yeah. I see Live update on enwikisource. Looks cool. \o/ [18:27:02] (03CR) 10Ottomata: [C: 032] Remove dependency on geowiki public data job [puppet] - 10https://gerrit.wikimedia.org/r/365672 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:27:05] " only works in the exceedingly rare event such as when the moon is full while simultaneously there is unusually little activity" :)) [18:27:31] but full moon means people don't sleep, which means more activity [18:27:41] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Services (done): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#3445139 (10GWicke) In terms of the concrete JMX protection question, the main proposals are: - Set up JMX authentication, with credentials stored a... [18:28:35] (03PS2) 10Dzahn: install_server: update MAC address of mw2201 in DHCP [puppet] - 10https://gerrit.wikimedia.org/r/365673 (https://phabricator.wikimedia.org/T170307) [18:28:50] Niharika: Seems to work. I haven't found a high-traffic enough wmf7 wiki to properly test on yet though [18:28:54] Oh, I know, Wikidata [18:29:29] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3445147 (10RobH) a:03Dzahn chatted with Daniel and he is handling the follow up on this task [18:29:32] I see new updates on enwikisource. [18:30:53] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:30:55] Working quite well on Wikidat [18:31:00] Looks good to sync [18:31:30] RoanKattouw: Why do we have a link saying "Show new changes starting from 18:31, 17 July 2017" with that timer counting up? Was it there before this feature? [18:31:33] Ack. Syncing. [18:32:25] The feature is very much quick & dirty right now [18:32:30] There's a reason it's hidden behind a query parameter :) [18:32:44] The implementation isn't great either [18:33:53] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:08] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3445226 (10RobH) This should be handled via self dispatch, not by contacting Dell. @Papaul: Please see my email on self dispatch, and get that handled before processing this request. [18:35:46] Yes that link was there before, it's the most obscure and unknown feature on RC [18:36:03] (03CR) 10Dzahn: [C: 032] install_server: update MAC address of mw2201 in DHCP [puppet] - 10https://gerrit.wikimedia.org/r/365673 (https://phabricator.wikimedia.org/T170307) (owner: 10Dzahn) [18:36:17] I used it a lot circa 2006 [18:36:34] But it's completely unintuitive, so in the new beta Pau's got a better design for it [18:36:42] !log niharika29@tin Synchronized php-1.30.0-wmf.9/resources/src/mediawiki.rcfilters/: RCFilters: Allow experimental live update feature to be enabled with query string parameter https://gerrit.wikimedia.org/r/#/c/365413/ (duration: 02m 51s) [18:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:57] RoanKattouw: Synced. Nice! [18:37:07] T163426 [18:37:07] T163426: Add 'View Newest Changes' option into the Integrated Filters - https://phabricator.wikimedia.org/T163426 [18:39:55] RoanKattouw: The mockups look cool. Much better than right now. [18:39:59] ~~~~~~~~~~~~SWAT over~~~~~~~~~~~~~ [18:40:11] neat [18:40:16] * thcipriani does train stuff [18:40:59] ebernhardson: I saw one of your patches for wmf.9 went with swat, did you need this one backported as well: https://gerrit.wikimedia.org/r/#/c/365405/ ? [18:41:44] jdlrobson: looking at the failure you noted here https://gerrit.wikimedia.org/r/#/c/365647/1 the output in jenkins and the phab ticket you linked to look like different issues [18:41:57] thcipriani: looking maybe wrong phab task [18:41:59] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:42:18] thcipriani: nope same issue [18:42:26] both tests belong to Minerva [18:42:50] change has nothing to do with Minerva [18:44:11] yeah, my guess is it's being triggered because https://github.com/wikimedia/integration-config/blob/master/zuul/parameter_functions.py#L186-L187 [18:44:14] !log rebooting mw2201 for MAC address change [18:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:48] jdlrobson: should that dependency be removed or can we fix the test in that dependency? Force merging will stall everything because zuul's world will have changed so I'd like to avoid that if possible... [18:47:12] thcipriani: i havent worked out how to fix that yet ... hence open bug [18:47:12] thcipriani: hasn't been merged yet, but should be safe to. That will basically start rejecting requests if the varnish caching rate drops dramatically [18:47:18] you can remove Minerva from the CI config [18:47:27] the Minerva PHPUnit job [18:48:13] (or maybe even drop requests earlier than that, not sure if i chose the right value for the initial pool counter size :P) [18:48:31] (03CR) 10Eevans: [C: 031] "To follow-up, this is limited to a single host in the dev environment (non-production)" [puppet] - 10https://gerrit.wikimedia.org/r/365081 (owner: 10Eevans) [18:50:41] thcipriani: more here: https://phabricator.wikimedia.org/T170624#3440788 [18:52:15] jdlrobson: fine to remove dependency for now? https://gerrit.wikimedia.org/r/#/c/365678/ [18:52:43] thcipriani: you'd need to remove it from RelatedArticles [18:52:55] thcipriani: aggh [18:52:59] actually that will create more isses [18:53:52] thcipriani: lemme think a bit. So goal is to get Jenkins to pass? [18:54:39] on wmf9? [18:54:55] jdlrobson: goal is to be able to backport your change to related article on wmf.9 without disrupting ci [18:55:01] *releatedarticles [18:55:32] https://gerrit.wikimedia.org/r/365679 Remove tests/phpunit/MenuBuilderTest.php < thcipriani [18:55:34] might do it [18:55:37] with the ultimate goal of getting wmf.9 out :) [18:57:04] jdlrobson: neat, yeah that matches up with the error I was seeing: "16:47:38 Fatal error: Class already declared: Tests\MediaWiki\Minerva\MenuTest" [18:57:12] thcipriani: but hoping it doenst show more errors.. [18:57:21] ie. the ones in https://phabricator.wikimedia.org/T170624#3440788 [18:57:40] ill start looking at how to fix https://phabricator.wikimedia.org/T170624 just in case [19:01:23] jdlrobson: now we have Fatal error: Class already declared: TestSkinMinerva :( [19:01:33] dammit a mosquito decided it's dinner time too [19:01:46] thcipriani: ah lemme delete them all.. 1s [19:02:18] oops, ignore please [19:16:09] ebernhardson: is there anything to check with https://gerrit.wikimedia.org/r/#/c/365677/ ? pulled over on mwdebug1002 currently [19:16:13] (03PS1) 10Ottomata: Move reportupdater jobs from stat1003 -> stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365684 (https://phabricator.wikimedia.org/T152712) [19:17:02] (03PS2) 10Ottomata: Move reportupdater jobs from stat1003 -> stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/365684 (https://phabricator.wikimedia.org/T152712) [19:17:32] thcipriani: not really, the question will be if the requests start filling the pool counter queue once shipped [19:18:00] i have a graph for that too. Sadly i don't have a graph for queue size, only for things that get dropped (queue size isn't exposed in php) [19:18:02] 10Operations, 10Recommendation-API, 10Service-deployment-requests, 10Services (done), 10User-mobrovac: New Service Request: recommendation-api - https://phabricator.wikimedia.org/T167664#3445668 (10mobrovac) 05Open>03Resolved a:03mobrovac The service is alive and kicking on SCB. Resolving. [19:19:19] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7087/" [puppet] - 10https://gerrit.wikimedia.org/r/365684 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [19:20:24] 10Operations, 10ops-codfw, 10Parsoid, 10Patch-For-Review, 10Services (watching): wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3445686 (10Papaul) First test complete. Running the second test {F8789623} [19:20:38] we usually peak at 125qps with a p50 of <100ms, and p95 of 325ms ... so i think 50 concurrent should be fine though [19:21:12] alright, going live with that change [19:22:38] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3445689 (10Papaul) Test complete with error 2002-0251 {F8789629} [19:26:18] !log thcipriani@tin Synchronized php-1.30.0-wmf.9/extensions/CirrusSearch: [[gerrit:365677|Add PoolCounter specifically for morelike]] T170648 (duration: 03m 02s) [19:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:31] T170648: CirrusSearch/includes/Elastica/PooledHttp.php:67 Timeout reached waiting for an available pooled curl connection! - https://phabricator.wikimedia.org/T170648 [19:30:14] thcipriani: all looks sane enough so far [19:35:23] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3445766 (10Dzahn) @greg I blocked him by saying we need to test the uploading (lots of history with having to deb... [19:36:52] ebernhardson thcipriani: just gonna grab some lunch. Need me? I'll ping you and RainbowSprinkles after lunch about the Minerva deploy... I'd really like to get that resolved since it's causing a lot of these issues. [19:37:01] i'd hate to block the next train too [19:37:07] k [19:37:27] also k [19:47:41] (03CR) 10Mobrovac: [C: 031] "Actually, should we pick one of restbase-dev100[456] instead in light of T166181 being tackled this week?" [puppet] - 10https://gerrit.wikimedia.org/r/365081 (owner: 10Eevans) [20:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170717T2000). Please do the needful. [20:00:12] No ORES today [20:00:15] Likely tomorrow :) [20:01:17] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [20:09:38] Hi! I need a sysadmin to overlook a renaming requested by User:Moros, who has ~170.000 contribs. [20:09:57] request is here: https://de.wikipedia.org/wiki/Wikipedia:Benutzernamen_%C3%A4ndern [20:10:47] no parsoid deploy today [20:11:37] (03PS1) 10Jdlrobson: Enable page previews for everyone on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365696 (https://phabricator.wikimedia.org/T162672) [20:14:12] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3445856 (10Smalyshev) Some [[ https://pivot.wikimedia.org/#webrequest/table/2/EQUQLgxg9AqgKgYWAGgN7APYAdgC5gQAWAhgJYB2KwApgB5YBO1Az... [20:14:30] (03CR) 10Eevans: [C: 031] "> Actually, should we pick one of restbase-dev100[456] instead in" [puppet] - 10https://gerrit.wikimedia.org/r/365081 (owner: 10Eevans) [20:17:26] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3445865 (10RobH) Please note that memory test output doesn't denote which memory dimm reported failed in the memory test. That information is required for that test to be useful. Otherwise, the logs show a single CPU repor... [20:19:27] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [20:19:46] !log thcipriani@tin Synchronized php-1.30.0-wmf.9/extensions/RelatedArticles: [[gerrit:365647|Add limit via ResourceLoaderGetConfigVars]] (duration: 02m 38s) [20:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:21] (03PS1) 10Thcipriani: all wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365702 [20:22:23] (03CR) 10Thcipriani: [C: 032] all wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365702 (owner: 10Thcipriani) [20:22:57] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [20:23:29] thcipriani hey there. How's deploy going? [20:23:58] (03Merged) 10jenkins-bot: all wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365702 (owner: 10Thcipriani) [20:24:08] jdlrobson: I got the relatedarticles change merged and out... getting ready to roll forward all wikis ^ [20:24:19] fingers are crossed for you! [20:24:42] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3445111 (10debt) Looks like the [[ http://ipinfo.io/AS20633/141.2.0.0/16-141.2.108.0/23 | IP ]] is from a University in Germany: Jo... [20:26:21] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.30.0-wmf.9 [20:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:48] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [20:27:57] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 2.814 second response time [20:28:17] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [1800.0] [20:28:27] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [20:28:57] PROBLEM - Disk space on stat1006 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [20:31:38] ACKNOWLEDGEMENT - Host mw2201 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn mainboard replaced [20:33:30] !log mw2201 - reinstalling OS after mainboard replacement (network interfaces became eth2/eth3 from eth0/eth1 so ferm failed etc) - T170307 [20:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:41] T170307: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307 [20:33:59] (03CR) 10jenkins-bot: all wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365702 (owner: 10Thcipriani) [20:35:07] RECOVERY - Host mw2201 is UP: PING WARNING - Packet loss = 58%, RTA = 36.28 ms [20:36:17] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [20:37:42] mutante: are you bringing back 2201? It was down while I was rolling forward mw branches just now, if you let me know when it's back I can make sure that its wikiversions are set correctly. (just requires scap pull && scap wikiversions-compile as mwdeploy on that machine) [20:37:45] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3445917 (10Smalyshev) Yup, no idea who that might be though. It produces 4% of all(!) queries on the service for the last week, and... [20:42:52] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3445111 (10Lydia_Pintscher) FWIW I don't know either who it could be. [20:43:42] ebernhardson: well, I pushed wmf.9 out, I don't see anything that jumps out on https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=42&fullscreen&orgId=1&from=now-15m&to=now [20:44:59] thcipriani: it is currently installing an OS while we speak, i tried to fix it without that at first, so yea and that would be great if you can sync it for me when done [20:45:14] watches the console and will updated [20:46:18] yup, just ping me whenever it's finished and I'll make sure it's where it should be wiki-wise [20:47:23] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3445965 (10RobH) I've gone ahead and used the self-dispatch tool to request a new mainboard. It seems the self-dispatch is still reviewed by their team, but we can list the exact parts and timeline for replacement. At the... [20:48:40] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3445111 (10EBernhardson) I don't know if it would work as well here, but other services in the cluster user a cluster-wide semaphor... [20:48:50] thank you, i also got connection issues, just got back online [20:49:44] thcipriani: yup everything looks pretty sane this time around [20:50:37] ebernhardson: nice, seems mostly sane on this end as well... [20:54:09] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3446017 (10GWicke) [20:55:57] PROBLEM - Host mw2201 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:28] that's the reboot after install was done [20:58:00] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3446032 (10Esc3300) Maybe @Gymel would know who. [20:59:17] RECOVERY - Host mw2201 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [21:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170717T2100). Please do the needful. [21:00:17] RECOVERY - DPKG on notebook1002 is OK: All packages OK [21:01:40] !log mw2201 - revoke old puppet cert, salt key, accept/sign news cert and key, initial pupet run .. T170307 [21:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:53] T170307: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307 [21:02:07] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3446099 (10Esc3300) [21:02:51] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3446101 (10Smalyshev) @EBernhardson we have per-IP limits, and they work fine in 99% of cases. In this case, however, the query is... [21:04:47] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [21:04:54] !log mw2202 - renew puppet cert that was accidentally revoked [21:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:33] (03PS1) 10Smalyshev: Enable banning clients by IP by setting wdqs::gui::bad_clients [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) [21:06:05] (03PS3) 10Thcipriani: CI/integration: Create role for docker CI agent [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) [21:06:07] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:04] (03PS4) 10Thcipriani: CI/integration: Create role for docker CI agent [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) [21:07:16] (03CR) 10jerkins-bot: [V: 04-1] Enable banning clients by IP by setting wdqs::gui::bad_clients [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) (owner: 10Smalyshev) [21:07:43] (03Abandoned) 10Thcipriani: contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [21:08:12] (03CR) 10Thcipriani: "> Overall looks good, any chance we can squash it to it's dependent" [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) (owner: 10Thcipriani) [21:09:49] waits a long time for puppet running .. hmm [21:10:20] (03PS2) 10Smalyshev: Enable banning clients by IP by setting wdqs::gui::bad_clients [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) [21:11:43] (03PS3) 10Smalyshev: Enable banning clients by IP by setting wdqs::gui::bad_clients [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) [21:14:38] PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:16:17] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [21:18:29] !log powercycling ocg1001 which went down and had no console output at all [21:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:34] eh.. that's dead [21:19:47] !log ocg1001 - dead - " Exception Inside the Exception Handler [21:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:38] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=ocg1001.eqiad.wmnet [21:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:05] !log ocg1001 - Type: General Protection Fault (13) Source: Software (UEFI0011) - depooled [21:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:51] 10Operations, 10OCG-General: ocg1001 is broken - https://phabricator.wikimedia.org/T170886#3446253 (10Dzahn) [21:24:53] 10Operations, 10ops-eqiad, 10OCG-General: ocg1001 is broken - https://phabricator.wikimedia.org/T170886#3446280 (10Dzahn) [21:25:17] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3445111 (10GWicke) {T167906} might help with this problem, especially if you can reduce the allowed concurrency to a fairly small value. [21:25:31] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Need ability to block specific query sources for WDQS - https://phabricator.wikimedia.org/T170860#3446287 (10faidon) [21:28:38] ACKNOWLEDGEMENT - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T170886 [21:29:16] (03PS1) 10Rush: localrun: add hiera_config path and show_diff [puppet] - 10https://gerrit.wikimedia.org/r/365862 [21:30:12] (03CR) 10Reedy: [C: 031] "Brandon, most of them have been looked at and improved. I think that this is the last one left. But there could be others" [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [21:31:28] (03PS4) 10Smalyshev: [WIP] Enable banning clients by IP by setting wdqs::gui::bad_clients [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) [21:31:49] 10Operations, 10Performance-Team, 10TemplateStyles, 10Traffic, and 3 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3446323 (10Jdforrester-WMF) [21:36:08] (03CR) 10Dzahn: "stat1006 went out of disk space today,related?" [puppet] - 10https://gerrit.wikimedia.org/r/365684 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [21:39:12] (03CR) 10Reedy: [C: 031] "Oh, just the Wikipedia ones to do" [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [21:39:22] (03PS2) 10Krinkle: Remove wikimedia-periodic-update.sh [puppet] - 10https://gerrit.wikimedia.org/r/354596 (owner: 10Reedy) [21:40:29] (03CR) 10Rush: [C: 032] localrun: add hiera_config path and show_diff [puppet] - 10https://gerrit.wikimedia.org/r/365862 (owner: 10Rush) [21:40:31] (03PS3) 10Krinkle: Remove wikimedia-periodic-update.sh [puppet] - 10https://gerrit.wikimedia.org/r/354596 (owner: 10Reedy) [21:40:33] (03CR) 10Krinkle: [C: 031] Remove wikimedia-periodic-update.sh [puppet] - 10https://gerrit.wikimedia.org/r/354596 (owner: 10Reedy) [21:44:02] (03PS4) 10Dzahn: Remove wikimedia-periodic-update.sh [puppet] - 10https://gerrit.wikimedia.org/r/354596 (owner: 10Reedy) [21:44:33] (03CR) 10Dzahn: [C: 032] "yes, "mediawiki::maintenance::update_flaggedrev_stats does this" and no puppet class references this file name." [puppet] - 10https://gerrit.wikimedia.org/r/354596 (owner: 10Reedy) [21:45:05] (03PS5) 10Dzahn: mediawiki: Remove wikimedia-periodic-update.sh [puppet] - 10https://gerrit.wikimedia.org/r/354596 (owner: 10Reedy) [21:54:37] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 4068 [21:57:02] !log thcipriani@tin Started scap: [[gerrit:365861|Add missing Minerva skin description message key]] prep for MinervaNeue deployment [21:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:47] STILL waits for the same puppet run.. [22:00:51] one appserver = half a day and people from 3 teams . grmmm [22:03:37] sounds about right [22:03:42] =P [22:06:10] (03PS5) 10Dzahn: Configure an additional data file directory [puppet] - 10https://gerrit.wikimedia.org/r/365081 (owner: 10Eevans) [22:06:34] (03PS6) 10Dzahn: restbase: Configure an additional data file directory (dev) [puppet] - 10https://gerrit.wikimedia.org/r/365081 (owner: 10Eevans) [22:06:45] (03CR) 10Dzahn: [C: 032] "per " limited to a single host in the dev environment (non-production)"" [puppet] - 10https://gerrit.wikimedia.org/r/365081 (owner: 10Eevans) [22:08:07] (03CR) 10jerkins-bot: [V: 04-1] restbase: Configure an additional data file directory (dev) [puppet] - 10https://gerrit.wikimedia.org/r/365081 (owner: 10Eevans) [22:08:59] (03PS7) 10Dzahn: restbase: Configure an additional data file directory (dev) [puppet] - 10https://gerrit.wikimedia.org/r/365081 (https://phabricator.wikimedia.org/T170276) (owner: 10Eevans) [22:09:14] robh: more new checks "22:07:46 Line 3: Bug: value must be a single phabricator task ID" [22:09:29] it caught a missing T , heh [22:12:17] ACKNOWLEDGEMENT - Disk space on stat1006 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%): daniel_zahn https://phabricator.wikimedia.org/T152712#3446370 [22:13:06] mutante that's the new commit msg check hashar introduced earlier today :) [22:14:40] paladox: it works.. it already found things more than once, heh [22:14:47] :) [22:14:57] mutante: durn it [22:15:12] i rarely link multiple tasks into a single patchset [22:15:14] but its nice to do so [22:15:15] oh well [22:15:25] yea, same [22:15:31] bitching that one time here is about how much energy i have for it ;D [22:16:00] !log thcipriani@tin Finished scap: [[gerrit:365861|Add missing Minerva skin description message key]] prep for MinervaNeue deployment (duration: 18m 57s) [22:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:20] ^ jdlrobson RainbowSprinkles those messages should be live [22:16:32] thcipriani: w00t [22:16:45] thcipriani: is Minerva enabled anywhere? [22:17:17] https://en.wikipedia.org/wiki/MediaWiki:Minerva-skin-desc [22:18:00] cool. can we enable Minerva now on mwdebug1001 or 1002 ? [22:18:14] jdlrobson: Minerva isn't enabled anywhere new... [22:18:30] thcipriani: sure, but it needs to be as otherwise all hell will break lose on next train [22:19:10] (i thought we had it on test wikipedia but it's not there any more) [22:19:35] I reverted everything back to the start state before I rolled wmf.9 forward. [22:19:47] this likely included enabling MinervaNeue on testwiki [22:21:56] disabling you mean? [22:22:26] RainbowSprinkles: thcipriani so feel comfortable enabling it now? I'd rather not wait till the actual train to do this. [22:23:28] what are the consequences of not enabling it now. I'm completely unfamiliar withe the rollout plan you both worked on and I've never deployed a skin. [22:24:05] thcipriani: if we don't do it now, as soon as the branch is cut every mobile view will throw a RuntimeException [22:24:13] Yeah, which was our wmf7/9 issue [22:24:20] It's best to enable when only 1 version live [22:24:37] why does that happen? [22:24:50] RainbowSprinkles: correction - that was wmf9 + https://gerrit.wikimedia.org/r/365122 [22:25:12] wmf7 was fine but had a css issue [22:25:23] wmf9 does not have https://gerrit.wikimedia.org/r/365122 [22:25:26] since we reverted it [22:25:34] wmf-next will have https://gerrit.wikimedia.org/r/365122 [22:25:38] Point being tho: it's easier to roll out to 1 version alone [22:25:49] RainbowSprinkles: as in enable it now? [22:26:05] Now, tomorrow morning, anytime before I load up the train [22:26:32] RainbowSprinkles: if we can do it now.. that would make me less nervous about the train tomorrow [22:26:50] wmf9 was fine last time before we reverted https://gerrit.wikimedia.org/r/365122 [22:28:51] (and i've got meetings all morning tomorrow :/) [22:32:07] PROBLEM - Check size of conntrack table on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:07] PROBLEM - nutcracker port on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:07] PROBLEM - Check systemd state on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:07] PROBLEM - nutcracker process on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:07] PROBLEM - configured eth on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:08] PROBLEM - MD RAID on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:17] PROBLEM - salt-minion processes on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:18] PROBLEM - dhclient process on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:27] PROBLEM - Apache HTTP on mw2201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:37] PROBLEM - Check whether ferm is active by checking the default input chain on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:37] PROBLEM - puppet last run on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:37] PROBLEM - Disk space on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:57] PROBLEM - HHVM processes on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:58] PROBLEM - Nginx local proxy to apache on mw2201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:58] PROBLEM - DPKG on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:58] PROBLEM - HHVM rendering on mw2201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:35:41] (03PS1) 10Rush: labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) [22:35:47] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:00] (03CR) 10jerkins-bot: [V: 04-1] labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [22:38:06] (03PS2) 10Rush: labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) [22:39:18] (03CR) 10jerkins-bot: [V: 04-1] labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [22:39:44] RainbowSprinkles: so tomorrow..? [22:45:55] jouncebot: next [22:45:55] In 0 hour(s) and 14 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170717T2300) [22:46:12] jdlrobson: Nothing else on swat, let's steal the window [22:46:39] RainbowSprinkles: roger [22:51:18] RECOVERY - Nginx local proxy to apache on mw2201 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.152 second response time [22:51:37] RECOVERY - Apache HTTP on mw2201 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [22:52:19] (03PS3) 10Rush: labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) [22:53:03] (03PS4) 10Rush: labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) [22:54:27] (03CR) 10jerkins-bot: [V: 04-1] labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [22:57:06] (03PS5) 10Rush: labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170717T2300). Please do the needful. [23:04:13] RainbowSprinkles: ready? [23:04:44] thcipriani: it's crazy, but that server is still not done or i would have pinged .. it's doing things though... [23:05:06] it just created /srv/mediawiki for scap [23:08:19] mutante: computers are weird :) [23:08:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw2201 is OK: OK ferm input default policy is set [23:08:48] RECOVERY - Disk space on mw2201 is OK: DISK OK [23:08:53] Yep, sorry. [23:08:55] Lez do it [23:09:17] RECOVERY - HHVM processes on mw2201 is OK: PROCS OK: 6 processes with command name hhvm [23:09:19] Do we have a patch already? [23:09:23] RainbowSprinkles: just tell me when to test [23:09:27] RECOVERY - Check size of conntrack table on mw2201 is OK: OK: nf_conntrack is 0 % full [23:09:27] RECOVERY - configured eth on mw2201 is OK: OK - interfaces up [23:09:28] RECOVERY - MD RAID on mw2201 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:09:28] RECOVERY - salt-minion processes on mw2201 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:09:36] RainbowSprinkles: not that i know of but we did it last week [23:09:37] RECOVERY - dhclient process on mw2201 is OK: PROCS OK: 0 processes with command name dhclient [23:09:40] so i guess we can revert the revert? [23:10:03] https://gerrit.wikimedia.org/r/365093 [23:10:05] starting with that? [23:10:15] or reverting https://gerrit.wikimedia.org/r/365188 RainbowSprinkles [23:10:39] i can prepare one for everywhere [23:11:16] Sounds good [23:12:00] i assume you will hit the revert button on the revert :) [23:12:08] (03PS1) 10Chad: Revert "Revert "Move testwiki to MinervaNeue"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365874 [23:13:02] (03CR) 10Chad: [C: 032] Revert "Revert "Move testwiki to MinervaNeue"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365874 (owner: 10Chad) [23:13:27] RECOVERY - DPKG on mw2201 is OK: All packages OK [23:13:38] (03PS1) 10Jdlrobson: Minerva is live everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365875 (https://phabricator.wikimedia.org/T166748) [23:15:38] (03PS2) 10Chad: Revert "Revert "Move testwiki to MinervaNeue"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365874 [23:17:27] PROBLEM - DPKG on mw2201 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:18:29] icinga-wm: patience... [23:18:38] (03CR) 10jenkins-bot: Revert "Revert "Move testwiki to MinervaNeue"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365874 (owner: 10Chad) [23:19:10] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: testwiki to minervaneue (duration: 00m 44s) [23:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:35] RainbowSprinkles: looking good on test [23:19:46] It's errrywhere [23:19:50] (03PS5) 10Smalyshev: [WIP] Enable banning clients by IP by setting wdqs::gui::bad_clients [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) [23:20:44] RainbowSprinkles: so test wiki is working fine. I see Minerva installed and don't see any issues in logstash [23:21:03] Ok good. Let's do the "all wikis" patch [23:21:07] Then pull to mwdebug1001 [23:21:14] (03PS2) 10Chad: Minerva is live everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365875 (https://phabricator.wikimedia.org/T166748) (owner: 10Jdlrobson) [23:21:29] RainbowSprinkles: sounds good to me! [23:22:28] RECOVERY - DPKG on mw2201 is OK: All packages OK [23:24:07] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 31 seconds ago with 9 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],File[/etc/firejail/mediawiki-converters.profile] [23:24:22] (03CR) 10Chad: [C: 032] Minerva is live everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365875 (https://phabricator.wikimedia.org/T166748) (owner: 10Jdlrobson) [23:25:39] (03Merged) 10jenkins-bot: Minerva is live everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365875 (https://phabricator.wikimedia.org/T166748) (owner: 10Jdlrobson) [23:26:08] (03CR) 10jenkins-bot: Minerva is live everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365875 (https://phabricator.wikimedia.org/T166748) (owner: 10Jdlrobson) [23:26:50] RainbowSprinkles: testing [23:27:50] RainbowSprinkles: looks good to me [23:27:58] Ok, one sec [23:28:01] no css mess [23:28:02] no fatals [23:28:07] all working as normal [23:28:37] RECOVERY - nutcracker port on mw2201 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [23:28:37] RECOVERY - nutcracker process on mw2201 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [23:31:46] jdlrobson: Ok, let's do it to everywhere. Please keep an eye on all spots, we may need to revert fast [23:32:00] RainbowSprinkles: yup [23:32:03] eyes like an eagle [23:32:25] Ok, button pressed [23:32:28] 10Operations, 10Release-Engineering-Team, 10Composer: Setup a Composer Repository (Packagist) for MediaWiki Extensions - https://phabricator.wikimedia.org/T170897#3446751 (10Reedy) [23:33:06] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: all wikis to minervaneue (duration: 00m 44s) [23:33:07] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [23:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:28] PROBLEM - HHVM rendering on mw2201 is CRITICAL: connect to address 10.192.32.89 and port 80: Connection refused [23:34:32] I don't see anything exploding [23:34:57] PROBLEM - Apache HTTP on mw2201 is CRITICAL: connect to address 10.192.32.89 and port 80: Connection refused [23:35:07] RainbowSprinkles: yeh it's looking good to me [23:35:37] PROBLEM - Nginx local proxy to apache on mw2201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 327 bytes in 0.150 second response time [23:35:37] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2201 is OK: OK: synced at Mon 2017-07-17 23:35:33 UTC. [23:36:00] (03CR) 10Smalyshev: [C: 031] Add sandboxing directives to wdqs-blazegraph.service [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [23:37:41] RainbowSprinkles: still no problems [23:37:50] thcipriani: it's ready to be synced now [23:38:04] mutante: /me does [23:38:17] RainbowSprinkles i think the train is good to go tomorrow :) [23:38:37] Wheeeee [23:38:37] RECOVERY - HHVM rendering on mw2201 is OK: HTTP OK: HTTP/1.1 200 OK - 79546 bytes in 1.097 second response time [23:38:37] RECOVERY - Nginx local proxy to apache on mw2201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.506 second response time [23:38:57] RECOVERY - Apache HTTP on mw2201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.996 second response time [23:39:51] RainbowSprinkles: i'll be around tomorrow in case it's not and i'll enable irc on my phone in case we have any emergencies over the evening [23:40:05] * RainbowSprinkles nods [23:40:59] mutante: this was mw2201.codfw.wmnet, right? [23:41:19] if so: up-to-date now [23:42:06] thcipriani: yes it was, thank you. and i also see all the services running. [23:45:07] cleared the systemd error and repooling it [23:45:47] RECOVERY - Check systemd state on mw2201 is OK: OK - running: The system is fully operational [23:46:37] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2201.codfw.wmnet [23:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:28] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3446803 (10Dzahn) mw2201 is fixed now. [23:47:42] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3446804 (10Dzahn) [23:48:28] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3446807 (10Dzahn) a:05Dzahn>03Papaul mw2201 has been reinstalled and repooled and is working now. the issue is resolved. thanks papaul. giving the ticket back...