[00:34:05] (03Restored) 10Dzahn: Set differential.allow-self-accept to true in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/281071 (https://phabricator.wikimedia.org/T131622) (owner: 10Paladox) [00:36:27] (03PS7) 10Dzahn: Set differential.allow-self-accept to true in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/281071 (https://phabricator.wikimedia.org/T131622) (owner: 10Paladox) [00:38:50] (03CR) 10Dzahn: [C: 032] "per " to make the default config match live."" [puppet] - 10https://gerrit.wikimedia.org/r/281071 (https://phabricator.wikimedia.org/T131622) (owner: 10Paladox) [00:39:05] (03PS8) 10Dzahn: Set differential.allow-self-accept to true in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/281071 (https://phabricator.wikimedia.org/T131622) (owner: 10Paladox) [00:39:15] (03Abandoned) 10Dzahn: Set differential.allow-self-accept to true in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/310394 (https://phabricator.wikimedia.org/T131622) (owner: 10Paladox) [00:43:51] (03PS1) 10Dzahn: partman: delete analytics-cisco recipe [puppet] - 10https://gerrit.wikimedia.org/r/310469 [00:44:16] (03CR) 10Dzahn: [C: 032] partman: delete analytics-cisco recipe [puppet] - 10https://gerrit.wikimedia.org/r/310469 (owner: 10Dzahn) [00:45:36] (03PS2) 10Dzahn: partman: delete analytics-cisco recipe [puppet] - 10https://gerrit.wikimedia.org/r/310469 [00:47:41] (03CR) 10Dzahn: "thanks ottomata, i did that in a separate change, wasn't sure about the other files yet" [puppet] - 10https://gerrit.wikimedia.org/r/306501 (owner: 10Dzahn) [00:47:47] (03PS3) 10Dzahn: partman: delete some more unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/306501 [00:48:56] (03PS2) 10Dzahn: Revert "archiva: migration class to rsync data to new host" [puppet] - 10https://gerrit.wikimedia.org/r/307900 (https://phabricator.wikimedia.org/T123725) [00:49:17] (03CR) 10jenkins-bot: [V: 04-1] Revert "archiva: migration class to rsync data to new host" [puppet] - 10https://gerrit.wikimedia.org/r/307900 (https://phabricator.wikimedia.org/T123725) (owner: 10Dzahn) [00:50:38] (03Abandoned) 10Dzahn: admin: add dpatrick to sectools-roots, put group in role [puppet] - 10https://gerrit.wikimedia.org/r/296651 (https://phabricator.wikimedia.org/T138873) (owner: 10Dzahn) [00:51:37] (03Abandoned) 10Dzahn: add IPv6 AAAA and reverse for palladium [dns] - 10https://gerrit.wikimedia.org/r/302624 (owner: 10Dzahn) [00:52:56] (03CR) 10Dzahn: "looks like Legoktm was right about this taking too long..." [puppet] - 10https://gerrit.wikimedia.org/r/307667 (https://phabricator.wikimedia.org/T143465) (owner: 10Dzahn) [00:53:49] :/ [01:04:59] (03PS4) 10Dzahn: chromium: Ubuntu and Debian compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300491 [01:07:37] (03PS5) 10Dzahn: chromium: Ubuntu and Debian compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300491 (https://phabricator.wikimedia.org/T141023) [01:13:58] 06Operations, 10Mail, 10OTRS, 10Wiki-Loves-Monuments: E-mails not being received by OTRS - https://phabricator.wikimedia.org/T145293#2625771 (10Platonides) The address info-nl@wikilovesmonuments.org does work, sending the email into OTRS, see 2016091410000178 SOA suggests DNS was last changed on ago 19th.... [01:15:33] (03CR) 10Dzahn: [C: 032] "no-op on osmium http://puppet-compiler.wmflabs.org/4062/" [puppet] - 10https://gerrit.wikimedia.org/r/300491 (https://phabricator.wikimedia.org/T141023) (owner: 10Dzahn) [01:15:41] (03PS6) 10Dzahn: chromium: Ubuntu and Debian compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300491 (https://phabricator.wikimedia.org/T141023) [01:18:52] 06Operations, 06Discovery, 06Labs, 06Maps, and 3 others: PostgreSQL query planner bug on labsdb1006 - https://phabricator.wikimedia.org/T145599#2635546 (10MaxSem) [01:22:51] 06Operations, 06Discovery, 06Labs, 06Maps, and 3 others: PostgreSQL query planner bug on labsdb1006 - https://phabricator.wikimedia.org/T145599#2635546 (10Yurik) I think we should simply update the postgres/postgis on labs instance [01:25:06] PROBLEM - MariaDB Slave Lag: m3 on db1043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1501.93 seconds [01:43:01] RECOVERY - MariaDB Slave Lag: m3 on db1043 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [01:54:02] (03PS1) 10Aaron Schulz: Set $wgAPIMaxLagThreshold => 3 and "max lag" => 6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310479 [01:54:46] (03CR) 10Aaron Schulz: [C: 032] Set $wgAPIMaxLagThreshold => 3 and "max lag" => 6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310479 (owner: 10Aaron Schulz) [01:55:20] (03Merged) 10jenkins-bot: Set $wgAPIMaxLagThreshold => 3 and "max lag" => 6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310479 (owner: 10Aaron Schulz) [01:56:40] !log aaron@tin Synchronized wmf-config: Set $wgAPIMaxLagThreshold => 3 and "max lag" => 6 (duration: 00m 51s) [01:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:16:21] (03CR) 10Alex Monk: "what's your TENANT_NAME?" [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) (owner: 10Alex Monk) [02:23:18] (03CR) 10Alex Monk: "This works from silver.wikimedia.org:~krenair" [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) (owner: 10Alex Monk) [02:33:44] (03CR) 10Dzahn: "yes, puppet-lint says" [puppet] - 10https://gerrit.wikimedia.org/r/308328 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [02:35:30] (03PS10) 10Dzahn: Fix broken refs/meta/config diffusion links in access section [puppet] - 10https://gerrit.wikimedia.org/r/308885 (https://phabricator.wikimedia.org/T137354) (owner: 10Paladox) [02:35:51] (03CR) 10Dzahn: [C: 032] "yea, confirmed. following the provided link currently gets us "Unhandled Exception ("DiffusionRefNotFoundException")" [puppet] - 10https://gerrit.wikimedia.org/r/308885 (https://phabricator.wikimedia.org/T137354) (owner: 10Paladox) [02:37:41] (03PS13) 10Dzahn: Gerit: Rewrite outdated comment about Gerrit-Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [02:39:35] (03PS14) 10Dzahn: Gerit: Rewrite outdated comment about Gerrit-Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [02:39:41] (03CR) 10Dzahn: [C: 032] Gerit: Rewrite outdated comment about Gerrit-Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [02:40:25] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 17m 56s) [02:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:57] !log gerrit restarting to apply config changes 256663 and 308885 [02:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:44:07] !log gerrit back to normal [02:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:47] !log restarted grrrrit-wm [02:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:49:02] (03CR) 10Dzahn: "added dpatrick and faidon because of the comments on ticket" [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox) [02:50:16] (03CR) 10Dzahn: [C: 031] ldap: migrate role classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/308314 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [02:52:36] (03CR) 10Dzahn: "@hashar it doesnt look to me like it affects production, see the "requires realm labs" and also "WARNING: TOTALLY DEPRECATED, DO NOT USE" " [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [02:56:10] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/4063/ (will merge tomorrow or so)" [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [02:58:46] (03CR) 10Dzahn: "and now the link works :)" [puppet] - 10https://gerrit.wikimedia.org/r/308885 (https://phabricator.wikimedia.org/T137354) (owner: 10Paladox) [02:59:07] (03CR) 10BryanDavis: [C: 031] Set $wgDefaultExternalStore for wikitech before Flow settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309225 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [02:59:33] (03CR) 10BryanDavis: [C: 031] Add logging channel for NewUserMessage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309346 (https://phabricator.wikimedia.org/T131957) (owner: 10Mattflaschen) [03:15:07] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.19) (duration: 18m 08s) [03:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:17:07] (03PS1) 10Jhobs: Initiate Hovercards A/B test on ruwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310483 (https://phabricator.wikimedia.org/T136746) [03:22:14] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Sep 14 03:22:13 UTC 2016 (duration 7m 6s) [03:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:12:44] (03PS2) 10Mattflaschen: Add logging channel for NewUserMessage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309346 (https://phabricator.wikimedia.org/T131957) [06:29:22] (03PS1) 10Giuseppe Lavagetto: role::puppetmaster::frontend: remove 'puppet' vhost [puppet] - 10https://gerrit.wikimedia.org/r/310490 [06:37:03] (03CR) 10Giuseppe Lavagetto: [C: 032] role::puppetmaster::frontend: remove 'puppet' vhost [puppet] - 10https://gerrit.wikimedia.org/r/310490 (owner: 10Giuseppe Lavagetto) [06:37:12] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/4064/ noop" [puppet] - 10https://gerrit.wikimedia.org/r/310490 (owner: 10Giuseppe Lavagetto) [06:39:08] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[debian-goodies] [06:49:08] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:50:25] (03PS1) 10Giuseppe Lavagetto: puppet: switch ulsfo to use srv records, puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/310492 [06:53:46] (03CR) 10Muehlenhoff: "freeipmi also exists in trusty. It even exists in precise, but the binary package names use a different scheme, but let's not bother about" [puppet] - 10https://gerrit.wikimedia.org/r/310369 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [06:54:36] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: switch ulsfo to use srv records, puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/310492 (owner: 10Giuseppe Lavagetto) [07:03:31] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [07:03:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [07:06:19] (03PS3) 10KartikMistry: Remove cxserver restbase_url [puppet] - 10https://gerrit.wikimedia.org/r/306674 (https://phabricator.wikimedia.org/T129284) [07:10:21] (03CR) 10KartikMistry: "Scheduled on Thursday Puppet SWAT, https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0September.C2.A015" [puppet] - 10https://gerrit.wikimedia.org/r/306674 (https://phabricator.wikimedia.org/T129284) (owner: 10KartikMistry) [07:11:22] 06Operations, 10Analytics: Remove cronspam from stat1002 to root@ - https://phabricator.wikimedia.org/T145606#2635815 (10elukey) [07:14:57] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2635833 (10MoritzMuehlenhoff) The stack trace sounds like a broken disk controller (or possibly broken RAM). I'd say let Chris to a hardware check. [07:20:25] 06Operations, 10DBA: Hardware check - https://phabricator.wikimedia.org/T145607#2635842 (10Marostegui) [07:32:42] !log legoktm@tin Synchronized php-1.28.0-wmf.19/includes/db/loadbalancer/LBFactory.php: Use cpPosTime cookie for same-domain redirects on DB change - https://gerrit.wikimedia.org/r/#/c/310494/ (duration: 00m 46s) [07:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:33:59] !log legoktm@tin Synchronized php-1.28.0-wmf.19/includes/db/ChronologyProtector.php: Use cpPosTime cookie for same-domain redirects on DB change - https://gerrit.wikimedia.org/r/#/c/310494/ (duration: 00m 47s) [07:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:35:00] 06Operations, 10ops-eqiad, 10DBA: Hardware check - https://phabricator.wikimedia.org/T145607#2635861 (10Marostegui) a:05Cmjohnson>03None [07:35:24] !log legoktm@tin Synchronized php-1.28.0-wmf.19/includes/MediaWiki.php: Use cpPosTime cookie for same-domain redirects on DB change - https://gerrit.wikimedia.org/r/#/c/310494/ (duration: 00m 45s) [07:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:35:53] 06Operations, 10ops-eqiad, 10DBA: db1082 hardware check - https://phabricator.wikimedia.org/T145607#2635865 (10jcrespo) [07:38:39] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2635866 (10jcrespo) Thank you @MoritzMuehlenhoff for your incredibly quick evaluation, I didn't even check the full stacktrace, you were really helpful. I will unsubscribe you so you do not suffer spam from the rest of t... [07:40:11] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2635885 (10jcrespo) @Marostegui While we wait, probably we can check the lifecycle hardware logs. [07:50:22] 06Operations, 07Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#2635891 (10Legoktm) [07:50:26] 06Operations, 07Puppet, 13Patch-For-Review: Add a Jenkins check that forbids creation of /modules/role/manifests/*.pp - https://phabricator.wikimedia.org/T144774#2635890 (10Legoktm) 05Open>03Resolved [07:55:48] 06Operations, 06Performance-Team, 10Thumbor: Gifsicle engine: AttributeError: 'Engine' object has no attribute 'exif' - https://phabricator.wikimedia.org/T145504#2635892 (10Gilles) [08:05:32] 06Operations, 06Performance-Team, 10Thumbor: Gifsicle engine: AttributeError: 'Engine' object has no attribute 'exif' - https://phabricator.wikimedia.org/T145504#2635902 (10Gilles) p:05Low>03Lowest This is an upstream bug for the Gifsicle engine used at the same time as the RESPECT_ORIENTATION option. F... [08:12:37] 06Operations, 06Performance-Team, 10Thumbor: rsvg exit status -11 - https://phabricator.wikimedia.org/T145610#2635918 (10Gilles) [08:21:58] 06Operations, 06Performance-Team, 10Thumbor: rsvg exit status -11 - https://phabricator.wikimedia.org/T145610#2635953 (10Gilles) It is a segfault coming from rsvg-convert, which affects both production and thumbor since they're running the same rsvg-convert binary: ``` gilles@thumbor1001:~$ /usr/bin/rsvg-co... [08:24:39] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2635956 (10Gilles) [08:24:42] 06Operations, 06Performance-Team, 10Thumbor: rsvg exit status -11 - https://phabricator.wikimedia.org/T145610#2635954 (10Gilles) 05Open>03declined Actually, this seems appropriate. It is an internal server error due to an rsvg bug, and we need to keep track of it. [08:29:38] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2635959 (10Gilles) [08:30:19] 06Operations, 10Analytics: Remove cronspam from stat1002 to root@ - https://phabricator.wikimedia.org/T145606#2635986 (10elukey) [08:32:30] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:32:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [08:32:43] 06Operations, 06Performance-Team, 10Thumbor: 0px thumbnail requests should fail more elegantly - https://phabricator.wikimedia.org/T145614#2635993 (10Gilles) [08:38:14] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2636021 (10Gilles) [08:40:27] (03PS1) 10Giuseppe Lavagetto: base::puppet: stop using SRV record [puppet] - 10https://gerrit.wikimedia.org/r/310496 [08:40:29] (03PS1) 10Giuseppe Lavagetto: base::puppet: add ca_server setting when needed [puppet] - 10https://gerrit.wikimedia.org/r/310497 [08:41:04] <_joe_> akosiaris: look at the second one please [08:42:11] 06Operations, 06Performance-Team, 10Thumbor: 0px thumbnail requests should fail more elegantly - https://phabricator.wikimedia.org/T145614#2636046 (10Gilles) [08:42:14] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2636047 (10Gilles) [08:42:17] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2636045 (10Gilles) [08:42:25] (03CR) 10Alexandros Kosiaris: [C: 031] base::puppet: stop using SRV record [puppet] - 10https://gerrit.wikimedia.org/r/310496 (owner: 10Giuseppe Lavagetto) [08:42:52] (03CR) 10Giuseppe Lavagetto: [C: 032] base::puppet: stop using SRV record [puppet] - 10https://gerrit.wikimedia.org/r/310496 (owner: 10Giuseppe Lavagetto) [08:42:57] 06Operations, 06Performance-Team, 10Thumbor: pdf failure - https://phabricator.wikimedia.org/T145617#2636049 (10Gilles) [08:43:04] any change on commons/uploads recently? [08:43:15] (03CR) 10Alexandros Kosiaris: [C: 031] base::puppet: add ca_server setting when needed [puppet] - 10https://gerrit.wikimedia.org/r/310497 (owner: 10Giuseppe Lavagetto) [08:43:27] !log alter localuser table in db2047 - T141951 [08:43:28] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [08:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:43:54] we are seeing muti-second queries on s4, so many that they are saturating the port [08:45:00] mostly coming from anonymous users [08:45:01] jynus: we've enabled thumbor yesterday for commons too, though it shouldn't ask mw anything [08:45:27] godog, I would bet either a deploy or an increasingly popular image [08:45:34] ah ok, also WLM is on this month [08:45:44] <_joe_> WLM? [08:45:50] wiki loves monuments [08:45:59] it is gone now [08:46:19] but something pupular would stop at varnish [08:46:27] no need to query the db [08:46:40] (specially because they were anonymous users) [08:48:33] 06Operations, 06Performance-Team, 10Thumbor: Thumbor SVG regexp insufficient - https://phabricator.wikimedia.org/T145618#2636066 (10Gilles) [08:51:48] 06Operations, 06Performance-Team, 10Thumbor: Thumbor SVG regexp insufficient - https://phabricator.wikimedia.org/T145618#2636086 (10Gilles) This is probably a fairly common svg syntax, as it's produced by Adobe Illustrator. The issue seems to be that the xmlns is defined as a reference. [08:56:54] 06Operations, 06Performance-Team, 10Thumbor: rsvg exit status -11 - https://phabricator.wikimedia.org/T145610#2636100 (10fgiunchedi) FTR the rsvg-convert crashes are tracked at {T137876} [08:57:49] (03CR) 10Alexandros Kosiaris: [C: 031] conftool: get conf from class parameters [puppet] - 10https://gerrit.wikimedia.org/r/310459 (owner: 10Alex Monk) [08:58:27] This is what I am talking about (CC marostegui): https://grafana.wikimedia.org/dashboard/db/mysql?panelId=5&fullscreen&from=1473832675420&to=1473843475420&var-dc=eqiad%20prometheus%2Fops&var-server=db1081 [08:59:15] normal traffic is 6-10 Mb/s [09:01:22] jynus: the traffic IN does not change so looks like it's not a huge amount of queries but queries that are returning a *lot* of rows [09:01:31] (03CR) 10Alexandros Kosiaris: [C: 031] monitoring: add check_ipmi_sensor plugin [puppet] - 10https://gerrit.wikimedia.org/r/310379 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [09:01:40] yep [09:02:33] (03CR) 10Alexandros Kosiaris: [C: 031] base: install 'freeipmi', 'libipc-run-perl' on jessie [puppet] - 10https://gerrit.wikimedia.org/r/310369 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [09:02:41] gotcha [09:03:24] (03CR) 10Alexandros Kosiaris: [C: 031] "I suppose we expect the BMC to tell us if say 80 Celcius is CRITICAL or not ?" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [09:04:13] dump? [09:04:34] :-P [09:07:17] (03CR) 10Alexandros Kosiaris: [C: 032] puppet-merge: Allow forcing, passing SHA1 args [puppet] - 10https://gerrit.wikimedia.org/r/310300 (owner: 10Alexandros Kosiaris) [09:07:22] (03PS3) 10Alexandros Kosiaris: puppet-merge: Allow forcing, passing SHA1 args [puppet] - 10https://gerrit.wikimedia.org/r/310300 [09:07:25] (03CR) 10Alexandros Kosiaris: [V: 032] puppet-merge: Allow forcing, passing SHA1 args [puppet] - 10https://gerrit.wikimedia.org/r/310300 (owner: 10Alexandros Kosiaris) [09:09:36] (03CR) 10Alexandros Kosiaris: puppetmaster: post-merge command is actually irrelevant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310301 (owner: 10Alexandros Kosiaris) [09:15:26] (03PS1) 10Muehlenhoff: Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet [puppet] - 10https://gerrit.wikimedia.org/r/310498 (https://phabricator.wikimedia.org/T145378) [09:16:01] 06Operations, 06Performance-Team, 10Thumbor: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636188 (10fgiunchedi) [09:16:37] (03CR) 10jenkins-bot: [V: 04-1] Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet [puppet] - 10https://gerrit.wikimedia.org/r/310498 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [09:17:23] (03CR) 10Alexandros Kosiaris: puppetmaster: Make puppet-merge a template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310302 (owner: 10Alexandros Kosiaris) [09:18:54] (03PS1) 10Filippo Giunchedi: thumbor: set MemoryLimit to 1G [puppet] - 10https://gerrit.wikimedia.org/r/310500 (https://phabricator.wikimedia.org/T145623) [09:18:57] Hello. I am requesting permission from operations to perform 5 bigdeletes on ruwiki. Cf. https://meta.wikimedia.org/wiki/Steward_requests/Miscellaneous#Deleting_a_pages_with_a_.3E5000_revisions_in_ruwiki [09:19:53] (03CR) 10Jcrespo: "What could possibly make I own the files, when it was packaged on a different machine with probably different UID?" [puppet] - 10https://gerrit.wikimedia.org/r/310498 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [09:22:14] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: set MemoryLimit to 1G [puppet] - 10https://gerrit.wikimedia.org/r/310500 (https://phabricator.wikimedia.org/T145623) (owner: 10Filippo Giunchedi) [09:22:19] (03PS2) 10Filippo Giunchedi: thumbor: set MemoryLimit to 1G [puppet] - 10https://gerrit.wikimedia.org/r/310500 (https://phabricator.wikimedia.org/T145623) [09:22:31] (03CR) 10Filippo Giunchedi: [V: 032] thumbor: set MemoryLimit to 1G [puppet] - 10https://gerrit.wikimedia.org/r/310500 (https://phabricator.wikimedia.org/T145623) (owner: 10Filippo Giunchedi) [09:23:06] (03PS5) 10Alexandros Kosiaris: puppetmaster: Change the backend forced ssh command [puppet] - 10https://gerrit.wikimedia.org/r/310304 [09:23:08] (03PS4) 10Alexandros Kosiaris: puppetmaster: Make puppet-merge a template [puppet] - 10https://gerrit.wikimedia.org/r/310302 [09:23:10] (03PS4) 10Alexandros Kosiaris: puppetmaster: Delete the post-merge hooks [puppet] - 10https://gerrit.wikimedia.org/r/310303 [09:23:51] jynus: who I should be talking with to get an ok before performing 5 bigdeletes on ruwiki (+45k revs 4 of them; +12k the other one)? We were told to ask your people before. [09:24:01] database disruption, etc. [09:24:39] "my people" will prbably say it's me! [09:24:44] :-) [09:26:00] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2636229 (10hashar) [09:26:33] mafk, can you give me details or point me to a ticket? [09:26:54] jynus: there's no ticket, but you can see my first message here [09:26:58] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2586022 (10hashar) [09:27:05] 11:19] mafk Hello. I am requesting permission from operations to perform 5 bigdeletes on ruwiki. Cf. https://meta.wikimedia.org/wiki/Steward_requests/Miscellaneous#Deleting_a_pages_with_a_.3E5000_revisions_in_ruwiki <-- @ jynus [09:27:18] mafk, sorry, I didn't see that [09:27:29] np [09:27:42] (03PS1) 10Giuseppe Lavagetto: puppetmaster::web_frontend: use secret() for non-fqdn sites [puppet] - 10https://gerrit.wikimedia.org/r/310501 [09:27:54] I'll delete using API this time, to avoid the page show that request timeout message [09:27:57] <_joe_> akosiaris: can you review ^^ [09:28:33] mafk, in general, you shouldn't need to ask permission, as long as you do them "properly" [09:28:47] there is a guideline for mass api changes [09:28:54] let me show you [09:28:58] jynus: vvv told us some years ago to stop by here in this cases [09:29:10] yes, and it is appreciated [09:29:16] :) [09:29:26] just let me show you what you should do [09:29:28] we don't want to disrupt operations [09:29:38] jynus: I'm on API sandbox [09:29:40] and with that, no issues should happen [09:29:42] it's easy [09:30:01] (03CR) 10Muehlenhoff: "The files are owned by uid 12161, which is your cluster-wide UID, so they were likely built on some wikimedia host. In most Debian packagi" [puppet] - 10https://gerrit.wikimedia.org/r/310498 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [09:30:15] (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster::web_frontend: use secret() for non-fqdn sites [puppet] - 10https://gerrit.wikimedia.org/r/310501 (owner: 10Giuseppe Lavagetto) [09:30:20] mark, let me find the page, it has been moved several times [09:31:03] https://ru.wikipedia.org/wiki/%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:NUMBEROF/data ? [09:31:16] ah, the help page you mean [09:31:18] no, the "API advice" [09:31:31] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636240 (10Gilles) Is there any way to know which requests were killed by that? [09:31:36] (03CR) 10Giuseppe Lavagetto: puppetmaster: Make puppet-merge a template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310302 (owner: 10Alexandros Kosiaris) [09:31:45] mafk, https://www.mediawiki.org/wiki/API:Etiquette [09:31:53] in particular, the idea is [09:31:56] jynus: if I do it via API Sandbox, one by one, would we have any issues? [09:32:04] 1) do things serially [09:32:09] so that would respond to that [09:32:15] and use the maxlag parameter [09:32:23] withi will make you pause if there are issues [09:32:34] 06Operations, 06Performance-Team, 10Thumbor: 'NoneType' object has no attribute 'lstrip' - https://phabricator.wikimedia.org/T145505#2636241 (10Gilles) [09:32:37] 06Operations, 06Performance-Team, 10Thumbor: Thumbor SVG regexp insufficient - https://phabricator.wikimedia.org/T145618#2636242 (10Gilles) [09:32:56] if you are going to do things "manually", it is difficult to create issues [09:33:12] unless you are changing high-sensitive templates [09:33:24] no robot will be used for this [09:33:37] okay, I'll start, please check ruwiki dbs for issues [09:33:43] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetmaster: post-merge command is actually irrelevant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310301 (owner: 10Alexandros Kosiaris) [09:33:43] sure, will do [09:33:53] and thanks for the heads up, mafk [09:34:00] now if there are issues, I can tell you here [09:34:08] jynus: how much maxlag do you recommend to me? [09:34:24] although I'd not want the job to stop unfinished [09:34:41] the idea is to, if lag increases, pause [09:34:52] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetmaster: Make puppet-merge a template [puppet] - 10https://gerrit.wikimedia.org/r/310302 (owner: 10Alexandros Kosiaris) [09:34:54] then restart when it goes back to 0 [09:35:23] e.g. it says: [09:35:36] (03CR) 10Marostegui: [C: 031] "Makes sense if we need to touch this file directly often." [puppet] - 10https://gerrit.wikimedia.org/r/310498 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [09:35:36] "Use maxlag=5 (5 seconds). This is an appropriate non-aggressive value, set as default value on Pywikibot. Higher values mean more aggressive behaviour, lower values are nicer." [09:35:52] "If you get a lag error, pause your script for at least 5 seconds before trying again." [09:36:01] that is useful advice [09:36:13] I'll use 6 [09:36:22] actually, that is worse [09:36:39] lower == better, but it could take more time to do the changes [09:37:11] 5 then? [09:37:15] or lower? [09:37:35] I would use 1-2 if you are not in a hurry [09:38:00] nowadays, 5 seconds of lag it is rare [09:38:16] 06Operations, 06Performance-Team, 10Thumbor: thumbor ffmpeg pipe deadlock - https://phabricator.wikimedia.org/T145626#2636257 (10fgiunchedi) [09:38:30] in progress one [09:38:51] <_joe_> akosiaris: I have some nits for 310302, but they can be addressed later if you prefer [09:39:02] jynus: "Ocurrió un error durante la carga de la respuesta a la consulta API: HTTP error: timeout" [09:39:48] interesting, let me see [09:41:08] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404#2636275 (10Gehel) p:05Triage>03High a:05dcausse>03Gehel [09:41:13] there are some connection issues on db1087 [09:41:24] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404#2628906 (10Gehel) [09:41:41] smaxage / maxage should be set? [09:42:34] mafk, if I am honest with you, api parameters are not my speciality, outside of the ones I mentioned to you [09:42:44] I'll retry [09:42:54] 06Operations, 06Performance-Team, 10Thumbor: thumbor ffmpeg pipe deadlock - https://phabricator.wikimedia.org/T145626#2636279 (10Gilles) I'll try to switch to using a temp file. Currently ffmpeg writes to an stdout stream, which is read by thumbor when the process is complete. I imagine the underlying pipe f... [09:43:07] (03CR) 10Giuseppe Lavagetto: "minor nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310302 (owner: 10Alexandros Kosiaris) [09:43:14] and I am not sure many people use them with the sandbox [09:43:15] 06Operations, 06Performance-Team, 10Thumbor: thumbor ffmpeg pipe deadlock - https://phabricator.wikimedia.org/T145626#2636280 (10Gilles) a:03Gilles [09:43:22] aside from testing [09:43:50] I would ask on a different chanel for that, while I can continue monitoring s6 (ruwiki) [09:44:22] you can even do it yoourself [09:44:22] jynus: If you need another pair of eyes, let me know [09:44:29] (03PS2) 10Giuseppe Lavagetto: puppetmaster::web_frontend: use secret() for non-fqdn sites [puppet] - 10https://gerrit.wikimedia.org/r/310501 [09:45:03] mafk, check https://dbtree.wikimedia.org/ for log issues [09:45:30] (of which there are none right now) [09:46:56] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::web_frontend: use secret() for non-fqdn sites [puppet] - 10https://gerrit.wikimedia.org/r/310501 (owner: 10Giuseppe Lavagetto) [09:47:05] I'll do it the normal way [09:47:09] delete tab [09:47:32] !log Renaming tables before dropping them T54924 [09:47:33] T54924: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924 [09:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:04] mafk, I would say go on, do not ask for permission first [09:48:04] aborted, exceeded 3 secs. [09:48:55] geez, this is a headache [09:49:10] I think for that, we would need help from the API experts, as that sounds like a bug [09:49:22] apergos maybe? [09:49:28] if it is not a huge issue [09:49:29] or he's on dumps? [09:49:36] to create a ticket [09:49:44] with what you are trying to run [09:49:55] and the error you are getting [09:50:06] and I will make sure to ping the right people [09:50:15] sorry for the inconveniences [09:50:34] no prob [09:51:44] maybe the process has changed, or previous methods no longer work [09:52:00] 06Operations, 10DBA: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#2636302 (10Marostegui) s1, enwiki, db1073 ``` MariaDB db1073 enwiki > rename table povwatch_log to TO_DROP_povwatch_log; Query OK, 0 rows affected (0.10 sec) MariaDB db1073 e... [09:53:02] maybe I need apihighlimits? [09:53:43] nah, I already have it [09:53:47] on global group [09:54:36] I cannot tell, please provide the query, the error, and when it happened, and I will make sure to redirect a bunch of people there [09:55:01] jynus: I guess I can't delete it via SQL right? [09:55:11] maybe you were told to ask for help because someone used some kind of super-secret method? [09:55:28] so after all, I may not be the right guy to help [09:56:00] mafk, we do not let anyone do anything with SQL, unless there is an emergency ongoing [09:56:10] only queries [09:56:30] it has to be mediawiki scripts [09:56:54] 06Operations, 10Traffic, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2636310 (10MoritzMuehlenhoff) Ack, we're currently using the internally-managed ldap-labs.codfw|eqiad.wikimedia.org cert which are valid until Nov 28 2017.... [09:56:57] <_joe_> saving reports from puppet is currently broken, fixing [09:56:59] (03CR) 10Alexandros Kosiaris: puppetmaster: Make puppet-merge a template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310302 (owner: 10Alexandros Kosiaris) [09:57:02] please add me to the ticket when you create it, so I can redirect it [09:57:11] (03PS6) 10Alexandros Kosiaris: puppetmaster: Change the backend forced ssh command [puppet] - 10https://gerrit.wikimedia.org/r/310304 [09:57:13] (03PS5) 10Alexandros Kosiaris: puppetmaster: Make puppet-merge a template [puppet] - 10https://gerrit.wikimedia.org/r/310302 [09:57:15] (03PS5) 10Alexandros Kosiaris: puppetmaster: Delete the post-merge hooks [puppet] - 10https://gerrit.wikimedia.org/r/310303 [09:57:30] Okay. [09:59:28] <_joe_> akosiaris: uhm I might need help here. [10:00:19] _joe_: ? [10:00:31] the repos ? [10:00:34] reports* ? [10:00:36] (03PS1) 10Alexandros Kosiaris: puppetmaster: Have private repo git config user.name [puppet] - 10https://gerrit.wikimedia.org/r/310503 [10:00:39] <_joe_> yes, just fixed [10:00:41] <_joe_> meh [10:01:16] 06Operations, 06Performance-Team, 10Thumbor: 0px thumbnail requests should fail more elegantly - https://phabricator.wikimedia.org/T145614#2636314 (10Gilles) [10:01:19] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636315 (10fgiunchedi) We could probably look at the access log from thumbor and correlate the time. e.g. the last time it happened at `09:55:45` for `thumbor@8823` ```lines=5 Se... [10:01:24] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown) [10:01:35] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown) [10:01:35] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown) [10:01:44] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown) [10:01:44] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown) [10:01:50] _joe_: ^ ? [10:02:07] with no failed resources ? what the... [10:02:26] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown) [10:02:30] <_joe_> akosiaris: that's the reports [10:02:32] <_joe_> yes [10:02:34] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown) [10:02:35] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown) [10:02:39] <_joe_> can you mute icinga-wm? [10:02:42] <_joe_> if's already fixed [10:02:43] yeah [10:02:44] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown) [10:02:46] <_joe_> *it's [10:03:36] !log alter localuser table in db2054 - T141951 [10:03:37] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [10:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:50] interesting, so the summary says 1 failure, the detailed report has no failures... [10:04:01] (03PS1) 10Giuseppe Lavagetto: puppetmaster::web_frontend: fix ownership of files [puppet] - 10https://gerrit.wikimedia.org/r/310504 [10:04:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster::web_frontend: fix ownership of files [puppet] - 10https://gerrit.wikimedia.org/r/310504 (owner: 10Giuseppe Lavagetto) [10:05:41] (03PS2) 10Alexandros Kosiaris: puppetmaster: Have private repo git config user.name [puppet] - 10https://gerrit.wikimedia.org/r/310503 [10:05:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Have private repo git config user.name [puppet] - 10https://gerrit.wikimedia.org/r/310503 (owner: 10Alexandros Kosiaris) [10:08:19] jynus: I managed to delete one via API, but the page with the less nr. of revs. [10:09:20] https://meta.wikimedia.org/wiki/User:MarcoAurelio/Pruebas [10:09:23] I remember a bug with similar issues [10:09:32] let me try to search it [10:11:57] this is why you probably were told to notify here: https://phabricator.wikimedia.org/T13402 [10:12:30] 06Operations, 10Wikimedia-Site-requests: Rename cbk-zamwiki to cbkwiki - https://phabricator.wikimedia.org/T124657#2636342 (10hashar) p:05Low>03Lowest [10:12:32] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Bhojpuri wikipedia should start with 'bho' instead of 'bh' to avoid confusion with Bihari - https://phabricator.wikimedia.org/T41968#2636343 (10hashar) p:05Normal>03Lowest [10:12:34] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename emlwiki -> eglwiki - https://phabricator.wikimedia.org/T36217#2636344 (10hashar) p:05Normal>03Lowest [10:12:36] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename Võro Wikipedia, fiu-vro -> vro - https://phabricator.wikimedia.org/T31186#2636345 (10hashar) p:05Normal>03Lowest [10:12:38] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename zh-classical -> lzh - https://phabricator.wikimedia.org/T30443#2636346 (10hashar) p:05Normal>03Lowest [10:12:40] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename zh-min-nan -> nan - https://phabricator.wikimedia.org/T30442#2636347 (10hashar) p:05Normal>03Lowest [10:12:42] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Renaming the Aramaic (arc) Wikipedia to the Syriac (syc) Wikipedia - https://phabricator.wikimedia.org/T28725#2636349 (10hashar) p:05Normal>03Lowest [10:12:44] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename zh-yue -> yue - https://phabricator.wikimedia.org/T30441#2636348 (10hashar) p:05Normal>03Lowest [10:12:46] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: language code change for Samogitian: "bat-smg" to "sgs" - https://phabricator.wikimedia.org/T27522#2636350 (10hashar) p:05Normal>03Lowest [10:12:48] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Move the Nourmande Wikipedia from nrm to nrf - https://phabricator.wikimedia.org/T25216#2636351 (10hashar) p:05Normal>03Lowest [10:12:50] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename the als.*.org projects to gsw.*.org - https://phabricator.wikimedia.org/T25215#2636352 (10hashar) p:05Normal>03Lowest [10:12:52] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename 'roa-rup' wikis to 'rup' - https://phabricator.wikimedia.org/T17988#2636353 (10hashar) p:05Low>03Lowest [10:14:21] mafk, please create such a ticket, the right people that will know more is not around at the moment [10:14:29] I am dropping those tasks from #operations , no need to get them in your umbrella [10:14:37] (under your umbrella) [10:15:12] mafk, sorry I wasn't more useful, but I will make sure it has the right attention [10:17:40] (03PS1) 10Giuseppe Lavagetto: Revert "role::puppetmaster::frontend: remove 'puppet' vhost" [puppet] - 10https://gerrit.wikimedia.org/r/310506 [10:23:01] (03CR) 10Hashar: "Indeed that is a noop for production." [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [10:23:11] (03PS2) 10Giuseppe Lavagetto: Revert "role::puppetmaster::frontend: remove 'puppet' vhost" [puppet] - 10https://gerrit.wikimedia.org/r/310506 [10:25:48] (03CR) 10Hashar: "Apparently it is not applied anywhere on labs https://tools.wmflabs.org/watroles/role/role::ircyall So maybe drop it entirely? Yuvipand" [puppet] - 10https://gerrit.wikimedia.org/r/308311 (owner: 10Hashar) [10:25:53] !log varnish-be restarted on cp4005 [10:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:27:24] 06Operations, 07Puppet: Puppetize ircyall & set up instance appropriately - https://phabricator.wikimedia.org/T1357#23862 (10hashar) https://gerrit.wikimedia.org/r/#/c/308311/ is to move the puppet class role::ircyall under the role module. From https://tools.wmflabs.org/watroles/role/role::ircyall it seems t... [10:28:03] (03CR) 10Hashar: "And in Phabricator I found the old task T1357 from November 2014. Maybe it was just a one off experiment." [puppet] - 10https://gerrit.wikimedia.org/r/308311 (owner: 10Hashar) [10:29:54] (03PS1) 10Alexandros Kosiaris: puppetmaster: Add --quiet option to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/310515 [10:30:14] (03PS1) 10Giuseppe Lavagetto: Added fake secrets for puppetmasters [labs/private] - 10https://gerrit.wikimedia.org/r/310516 [10:30:31] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:30:31] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:30:34] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:30:49] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:30:50] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Added fake secrets for puppetmasters [labs/private] - 10https://gerrit.wikimedia.org/r/310516 (owner: 10Giuseppe Lavagetto) [10:30:55] hmm maybe another 30 mins without icinga-wm is ok [10:31:06] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:31:06] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:31:27] !log stopped temporarily ircecho (icinga-wm) on neon [10:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:25] (03PS24) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [10:36:40] jynus: filed and CCd you [10:37:01] I will subscribe tgr|away too [10:38:03] mafk, thank you, let me see [10:38:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "PCC says it's a noop on palladium, and does the expected changes on puppetmaster1001" [puppet] - 10https://gerrit.wikimedia.org/r/310506 (owner: 10Giuseppe Lavagetto) [10:41:52] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2636424 (10Marostegui) Unfortunately the ILO isn't showing anything relevant hardware-wise between the crash and when we power cyled the server This is the first record from yesterday which is basically when we connected... [10:44:10] mafk, I am checking our own logs to try to provide more information [10:44:23] jynus: thank you very much [10:44:35] logstash may have something [10:44:42] but I don't have access to it [10:45:04] (03PS1) 10Filippo Giunchedi: standard: add prometheus node_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/310519 (https://phabricator.wikimedia.org/T140646) [10:45:57] (03PS25) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [10:46:57] (03PS1) 10Giuseppe Lavagetto: puppetmaster::web_frontend: use ssldir in the vhost too [puppet] - 10https://gerrit.wikimedia.org/r/310521 [10:47:51] (03Abandoned) 10Filippo Giunchedi: site: add prometheus node_exporter to more codfw machines [puppet] - 10https://gerrit.wikimedia.org/r/307929 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [10:50:17] 06Operations, 10DBA, 10MediaWiki-API, 10MediaWiki-Page-deletion: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636434 (10MarcoAurelio) [10:50:34] <_joe_> is CI down? [10:50:49] (03CR) 10Jcrespo: [C: 031] "I am ok with this as a temporary measure. When rolled it to all servers, it should be moved to the mariadb module (or we can move it now, " [puppet] - 10https://gerrit.wikimedia.org/r/310498 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [10:53:05] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2636435 (10Marostegui) a:03Marostegui [10:53:13] (03PS2) 10Filippo Giunchedi: standard: add prometheus node_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/310519 (https://phabricator.wikimedia.org/T140646) [10:54:56] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "verified with pcc" [puppet] - 10https://gerrit.wikimedia.org/r/310521 (owner: 10Giuseppe Lavagetto) [10:56:32] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2636440 (10jcrespo) p:05Triage>03High [10:57:29] 06Operations, 10ops-eqiad, 10DBA: db1082 hardware check - https://phabricator.wikimedia.org/T145607#2636444 (10jcrespo) p:05Triage>03High I am going to put this high, because this block putting the server back into production, and that means it will lag for longer, so this is time sensitive. [11:01:31] (03PS1) 10Giuseppe Lavagetto: puppetmaster::web_frontend: fix ssldir [puppet] - 10https://gerrit.wikimedia.org/r/310522 [11:03:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster::web_frontend: fix ssldir [puppet] - 10https://gerrit.wikimedia.org/r/310522 (owner: 10Giuseppe Lavagetto) [11:06:16] looks like CI is working, afaikt [11:07:23] <_joe_> yes, it's zuul, probably [11:07:36] (03CR) 10Filippo Giunchedi: "test run with PCC https://puppet-compiler.wmflabs.org/4070/" [puppet] - 10https://gerrit.wikimedia.org/r/310519 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [11:09:28] <_joe_> I might screw puppet up again [11:12:42] <_joe_> turns out I didn't [11:19:19] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet runs fails randomly on deployment-prep / beta cluster hosts - https://phabricator.wikimedia.org/T145631#2636458 (10hashar) [11:21:46] 06Operations, 10DBA, 10MediaWiki-API, 10MediaWiki-Page-deletion, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636471 (10jcrespo) @Anomie If you can give a look at this (I am myself a bit lost) and... [11:27:29] mafk, I found the error, it is certainly hitting a transaction limit [11:28:16] the right people will probably wake up a bit later, and I hope we can get a workaround or a fix soon [11:28:28] I hope you can wait some hours, and again, apologies [11:38:31] oh there is a transaction limit for the api? [11:39:03] I think the limit was reduced recently [11:39:12] I am assuming something like that [11:39:33] but as it is mediawiki timeout'ing, not mysql [11:40:05] I prefer the exports to to chime in (it wouldn't be the first time I assume somthing wrongly for mediawiki) [11:40:19] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636484 (10Gilles) It looks like it's dying without an error. I.e. it might have been processing a request and everything was fine from thumbor's perspective. I'll turn the debug... [11:40:32] but this is clearly impacting users, so it should have high priority, I think [11:42:00] 06Operations, 06Performance-Team, 10Thumbor: Separate 404s into their own log - https://phabricator.wikimedia.org/T145632#2636486 (10Gilles) [11:44:40] (03CR) 10ArielGlenn: [C: 031] snapshot: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/308355 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [11:46:40] (03PS1) 10Giuseppe Lavagetto: Switch puppet for codfw and ulsfo to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/310527 [11:46:51] <_joe_> akosiaris: ^^ seems ok? [11:47:06] (03PS1) 10Muehlenhoff: Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) [11:47:19] (03CR) 10Alexandros Kosiaris: [C: 032] Switch puppet for codfw and ulsfo to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/310527 (owner: 10Giuseppe Lavagetto) [11:47:46] (03Abandoned) 10Muehlenhoff: Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet [puppet] - 10https://gerrit.wikimedia.org/r/310498 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [11:47:56] _joe_: merged, let's see all hell breaking loose now [11:48:22] <_joe_> akosiaris: are you running authdns-update too? [11:48:28] ran it already [11:48:44] <_joe_> kamikaze :P [11:49:08] running puppet manually on a few hosts now [11:49:26] <_joe_> ok [11:49:52] hmm, gotta wait 5 mins [11:50:03] or purge the RR from recdns... both easy enough [11:50:06] <_joe_> heh, about to say that :) [11:50:07] but I 'll do the latter [11:52:33] being able to actually wipe specific records is so nice [11:52:41] damn precise had a bug and we could not [11:52:58] <_joe_> we should try a few precises too [11:53:05] <_joe_> we have none in codfw IIRC [11:53:05] _joe_: FYI, all looks good for now [11:53:18] only in eqiad [11:54:57] <_joe_> so let's try gallium, I'll add an entry to /etc/hosts [11:55:08] (03PS1) 10Muehlenhoff: Drop the malloc wrapper from mysqld_safe [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310529 (https://phabricator.wikimedia.org/T145378) [11:55:20] _joe_: all of ulsfo is fine [11:55:49] I am gonna see how the new infra actually holds ... [11:55:55] <_joe_> akosiaris: you already ran puppet on every node? [11:56:11] _joe_: yup [11:56:12] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636521 (10Gilles) I see that it seems to happen around 17 times/hour on average. Meaning 0.007% of requests would die that way. And if only thumb hits are considered, that's 0.12... [11:56:57] _joe_: and I 've actually forced a puppet ran across all of codfw [11:57:08] _joe_: which normally would be idiotic [11:57:22] <_joe_> akosiaris: man you're gonna kill rhodium as well [11:57:22] but seems like the 3 servers can take it this time around [11:57:36] <_joe_> puppetmaster1001 is at full steam right now [11:57:41] <_joe_> 150 apache busy threads [11:58:00] not really [11:58:03] <_joe_> rhodium is dying [11:58:13] I mean, yeah 150 apache busy threads but puppetmaster1001 and 1002 are ok [11:58:15] <_joe_> cpu is maxed out, so is memory [11:58:31] <_joe_> we should remove it from the lb in puppetmaster1001 [11:58:44] it is not maxed out [11:58:52] it's on 12GB ram [11:58:54] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636525 (10jcrespo) I tried it, the only thing I took from here is that MariaDB -at least the version we use- is dumb, and I am 99% convinced this issue would b... [11:59:10] out of 16 and no cpu is above 40% constantly [11:59:24] <_joe_> yeah had a peak 1 minute ago [11:59:33] <_joe_> wow, it can really hold something like this? [11:59:38] if there is one thing limiting the impact is probably the 150 apache children [11:59:59] that being said [12:00:04] ?[mNotice: Finished catalog run in 236.27 seconds?[0m [12:00:18] so, it obviously has an impact on the agent times [12:00:26] <_joe_> that's very nice, though [12:00:38] but I am willing to bet it's the 150 max children... [12:00:43] which is awesome! [12:00:51] <_joe_> yes, and new servers have more memory too, right? [12:00:59] <_joe_> nope [12:01:00] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636529 (10jcrespo) a:05jcrespo>03None This is a one line patch, I can do it if you have faith in me; I personally don't. I would like to focus on the docum... [12:01:02] <_joe_> but still [12:01:04] no, they are on 16GB as well [12:01:30] (03PS1) 10Muehlenhoff: Don't source a custom mysql configuration from /srv/sqldata/my.conf [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310530 (https://phabricator.wikimedia.org/T145378) [12:02:10] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:02:17] <_joe_> akosiaris: btw rhodium was loaded by the rest of the infrastructure [12:02:20] <_joe_> uhm [12:02:43] expected [12:02:49] <_joe_> puppetmaster1001 has constant 150 busy threads even now [12:03:04] <_joe_> seems like you're going to get a lot of puppet failures [12:03:26] <_joe_> so the tl;dr is "have more than 150 apache threads on the frontend" [12:03:43] PROBLEM - puppetmaster backend https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:04] <_joe_> akosiaris: it seems like passenger died on puppetmaster1001 [12:04:11] <_joe_> all ruby processes are idling [12:04:56] <_joe_> they're blocked on a select() to a pipe that is how ruby communicates with mod_passenger [12:05:03] <_joe_> we'll need to restart apache there [12:05:05] sigh [12:05:31] well, ok.. at least we know it can more or less withstand that much traffic [12:05:36] <_joe_> !log restarting apache on puppetmaster1001 [12:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:47] not sure why it did though [12:05:50] died* [12:05:55] <_joe_> akosiaris: try -b 15% next time :) [12:06:03] where's the fun in that ? [12:06:11] I did actually think about it [12:06:13] RECOVERY - puppetmaster backend https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.969 second response time [12:06:16] anyway [12:06:34] <_joe_> I have to go to lunch in a few [12:06:35] we are going to get failures ofc [12:06:38] ok [12:06:39] <_joe_> yes [12:07:14] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 331 bytes in 0.541 second response time [12:07:42] <_joe_> it's maxing out again, FWIW [12:07:49] <_joe_> you damn cowboy :P [12:08:15] most agents are done [12:08:19] so we should be ok [12:08:36] yeah it's dropping down to normal levels now [12:09:15] <_joe_> yup [12:09:26] <_joe_> btw, look at the access log for the puppet vhost [12:09:38] <_joe_> and you'll see why managing files in people's homes is a bad idea [12:09:45] <_joe_> as I said the first time. [12:10:05] heh [12:10:27] <_joe_> if we didn't, it would be so much easier to support all of our infrastructure... [12:11:21] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_confd_lint],File[/usr/local/lib/nagios/plugins/check_confd_template],File[/home/midom],File[/home/yuvipanda] [12:11:21] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 5 minutes ago with 11 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishmedia],File[/etc/logrotate.d/confd],File[/usr/local/bin/confd-lint-wrap],File[/usr/local/lib/nagios/plugins/check_confd_lint] [12:11:21] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/20-confd.conf],File[/usr/local/lib/nagios/plugins/check_confd_lint],File[/usr/local/lib/nagios/plugins/check_confd_template],File[/home/midom] [12:11:21] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_confd_lint],File[/usr/local/lib/nagios/plugins/check_confd_template],File[/home/midom],File[/home/yuvipanda] [12:11:29] PROBLEM - puppet last run on mc2011 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:11:30] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:11:30] PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:11:30] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/aaron],File[/home/tstarling],File[/home/ema],File[/home/robh] [12:11:30] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/eevans],File[/home/yuvipanda],File[/home/aaron] [12:11:30] PROBLEM - puppet last run on planet2001 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/home/filippo],File[/home/oblivian],File[/home/jmm],File[/home/marostegui] [12:11:30] PROBLEM - puppet last run on wtp2001 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/eevans],File[/home/yuvipanda],File[/home/aaron] [12:11:31] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/eevans],File[/home/yuvipanda],File[/home/aaron] [12:11:31] PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/eevans],File[/home/yuvipanda],File[/home/aaron] [12:11:32] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:11:32] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/root.d/labstore],File[/home/midom],File[/home/yuvipanda],File[/home/aaron] [12:11:33] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/aaron],File[/home/tstarling],File[/home/gwicke],File[/home/ema] [12:11:33] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/awight],File[/home/marostegui],File[/home/niharika29],File[/home/joal] [12:11:34] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/eevans],File[/home/yuvipanda],File[/home/aaron] [12:11:50] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 5 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/confd],File[/usr/local/bin/confd-lint-wrap],File[/usr/local/lib/nagios/plugins/check_confd_lint],File[/usr/local/lib/nagios/plugins/check_confd_template] [12:11:50] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:11:50] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling],File[/home/ema] [12:11:50] PROBLEM - puppet last run on mc2013 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling],File[/home/ema] [12:11:50] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/eevans],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:11:50] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:11:50] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 6 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:11:51] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling],File[/home/gwicke] [12:11:51] PROBLEM - puppet last run on maps2004 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_postgres_replication_lag.py],File[/etc/postgresql/9.4/main/slave.conf],File[/home/midom],File[/home/yurik] [12:11:52] PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/laner],File[/home/elukey],File[/home/jynus],File[/home/akosiaris] [12:11:52] PROBLEM - puppet last run on mc2012 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 6 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/aaron],File[/home/tstarling],File[/home/ema],File[/home/robh] [12:11:53] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/tstarling],File[/home/gwicke],File[/home/ema],File[/home/robh] [12:11:53] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 5 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/root.d/labstore],File[/home/midom],File[/home/yuvipanda],File[/home/aaron] [12:11:54] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/root.d/labstore],File[/home/midom],File[/home/yuvipanda],File[/home/aaron] [12:12:09] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 5 minutes ago with 17 failures. Failed resources (up to 3 shown): File[/usr/local/bin/phaste],File[/usr/local/lib/nagios/plugins/get-raid-status-hpssacli],File[/usr/local/lib/nagios/plugins/check_raid],File[/usr/local/lib/nagios/plugins/check_puppetrun] [12:12:21] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 6 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/usr/local/bin/confd-lint-wrap],File[/etc/rsyslog.d/20-confd.conf],File[/usr/local/lib/nagios/plugins/check_confd_lint],File[/usr/local/lib/nagios/plugins/check_confd_template] [12:12:21] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 6 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:12:21] PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 6 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/laner],File[/home/elukey],File[/home/jynus],File[/home/akosiaris] [12:12:21] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:12:21] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 6 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/ema],File[/home/robh],File[/home/jgreen],File[/home/laner] [12:12:21] PROBLEM - puppet last run on ganeti2003 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/jgreen],File[/home/laner],File[/home/elukey],File[/home/jynus] [12:12:22] PROBLEM - puppet last run on es2019 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:12:22] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling],File[/home/ema] [12:12:23] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/eevans],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:12:23] PROBLEM - puppet last run on restbase2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [12:12:24] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 6 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:12:24] PROBLEM - puppet last run on auth2001 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 6 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling],File[/home/ema] [12:12:25] PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/yuvipanda],File[/home/aaron],File[/home/tstarling] [12:12:25] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/mholloway-shell],File[/home/midom],File[/home/yurik],File[/home/eevans] [12:12:26] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/home/midom],File[/home/eevans],File[/home/yuvipanda],File[/home/aaron] [12:12:26] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 5 minutes ago with 14 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/bin/puppet-enabled],File[/usr/lib/nagios/plugins/check_sysctl],File[/etc/sysctl.d] [12:12:47] !log stop ircecho (icinga-wm) temporarily on neon [12:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:12:56] all this ^ is known and expected btw! [12:13:52] (03CR) 10Volans: "a couple of questions inline and a typo" (032 comments) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [12:21:49] (03CR) 10Muehlenhoff: Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet (032 comments) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [12:23:23] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636574 (10Gilles) First instance of it seems to be while trying to create a huge PDF thumbnail: ``` Sep 14 12:01:56 thumbor1001 thumbor@8827[23855]: [2016-09-14 12:01:56,289 -... [12:25:55] (03CR) 10Jcrespo: Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [12:33:54] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636602 (10Gilles) Second example, this time it dies trying to process at 17MB PNG extracted from a PDF: ``` Sep 14 12:23:28 thumbor1001 thumbor@8824[23870]: [2016-09-14 12:23:2... [12:35:39] hashar: still no patches for eu swat today, assuming it will not happen [12:36:20] 06Operations, 06Performance-Team, 10Thumbor: Use intermediary high-quality JPEGs rather than PNGs for PDF thumbnailing - https://phabricator.wikimedia.org/T145637#2636604 (10Gilles) [12:41:29] zeljkof hashar, there might be 1 patch for eu swat today (www.wikipedia.org portal) [12:41:36] \O/ [12:42:11] (03PS4) 10Mobrovac: Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) [12:44:30] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636620 (10Gilles) Actually I now realize that the first one was also a PDF, and died for the same reason, it just got slightly further in the IM logging output before it died. [12:45:33] (03PS2) 10Giuseppe Lavagetto: puppetmaster: switch ca_server to be puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/310274 [12:49:05] PROBLEM - puppet last run on elastic2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [12:49:11] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636625 (10Gilles) I've found a different kind, this time it's VIPS dying because we ask it for a giant PNG made from a giant TIFF: ``` Sep 14 12:39:15 thumbor1001 thumbor@8834[7... [12:49:29] (03PS1) 10Jdrewniak: Bumping portals to master Stats update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310537 (https://phabricator.wikimedia.org/T128546) [12:52:02] 06Operations, 06Performance-Team, 10Thumbor: VIPS engine should generate JPG when dealing with TIFFs and not have the IM engine read it - https://phabricator.wikimedia.org/T145638#2636630 (10Gilles) [12:53:21] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [12:53:41] hashar: ok, looks like there is one commit for eu swat [12:53:47] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: switch ca_server to be puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/310274 (owner: 10Giuseppe Lavagetto) [12:54:01] what's the plan? irc? hangout? pairing? solo? [12:55:14] (03PS2) 10Muehlenhoff: Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) [12:57:54] (03PS1) 10Giuseppe Lavagetto: Revert "puppetmaster: switch ca_server to be puppetmaster1001" [puppet] - 10https://gerrit.wikimedia.org/r/310538 [12:58:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "puppetmaster: switch ca_server to be puppetmaster1001" [puppet] - 10https://gerrit.wikimedia.org/r/310538 (owner: 10Giuseppe Lavagetto) [12:58:59] <_joe_> ouch [12:59:05] <_joe_> puppet failures will arrive [12:59:23] zeljkof: yeah for jan_drewniak and the portals [13:00:01] :( [13:00:04] hashar, Dereckson, addshore, and aude: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160914T1300). Please do the needful. [13:00:04] jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:01:16] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636658 (10Gilles) Another find was an OOM on a large JPG (45MB) with a target width of 0. Which will be caught early as an invalid request thanks to T145614 in the future. [13:01:29] (03PS1) 10Filippo Giunchedi: thumbor: increase icinga retries for service units [puppet] - 10https://gerrit.wikimedia.org/r/310539 (https://phabricator.wikimedia.org/T145623) [13:01:38] (03CR) 10Jcrespo: "root@db1082:~$ diff ~/mysqld_safe /opt/wmf-mariadb10/bin/mysqld_safe" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [13:02:19] oh sync-portals [13:02:45] (03CR) 10Hashar: [C: 032] Bumping portals to master Stats update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310537 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:03:15] (03Merged) 10jenkins-bot: Bumping portals to master Stats update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310537 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:04:13] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636663 (10Gilles) Trying to resize a 53MB JPG OOMs in a predictable fashion, and also fails in production: https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/1857_Bird%27s... [13:04:20] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:20] PROBLEM - puppet last run on planet2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:21] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:21] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Puppet has 12 failures. Last run 6 minutes ago with 12 failures. Failed resources (up to 3 shown) [13:04:22] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 15 failures. Last run 6 minutes ago with 15 failures. Failed resources (up to 3 shown) [13:04:22] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: Puppet has 30 failures. Last run 5 minutes ago with 30 failures. Failed resources (up to 3 shown) [13:04:22] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:22] PROBLEM - puppet last run on wtp2003 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 5 minutes ago with 5 failures. Failed resources (up to 3 shown) [13:04:40] PROBLEM - puppet last run on mc2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:40] PROBLEM - puppet last run on maps2004 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown) [13:04:40] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:40] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown) [13:04:41] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 5 minutes ago with 11 failures. Failed resources (up to 3 shown) [13:04:42] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:43] (puppet errors as expected solved by https://gerrit.wikimedia.org/r/310538 ) [13:04:50] (03PS1) 10BBlack: LRU_Fail debugging [debs/varnish4] - 10https://gerrit.wikimedia.org/r/310540 [13:04:52] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:52] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown) [13:05:01] hashar: what's the plan for swat? you? me? pairing? [13:05:06] doing it [13:05:07] (03PS1) 10Giuseppe Lavagetto: Revert "Revert "puppetmaster: switch ca_server to be puppetmaster1001"" [puppet] - 10https://gerrit.wikimedia.org/r/310541 [13:05:11] PROBLEM - puppet last run on elastic2010 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 6 minutes ago with 7 failures. Failed resources (up to 3 shown) [13:05:11] PROBLEM - puppet last run on db2062 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 7 minutes ago with 10 failures. Failed resources (up to 3 shown) [13:05:12] rebasing portals on tin [13:05:12] PROBLEM - puppet last run on mw2162 is CRITICAL: CRITICAL: Puppet has 16 failures. Last run 5 minutes ago with 16 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_sysctl],File[/usr/local/sbin/hhvm_cleanup_cache],File[/usr/local/bin/apache-status],File[/usr/lib/nagios/plugins/check_conntrack] [13:05:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "Revert "puppetmaster: switch ca_server to be puppetmaster1001"" [puppet] - 10https://gerrit.wikimedia.org/r/310541 (owner: 10Giuseppe Lavagetto) [13:05:20] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 7 minutes ago with 8 failures. Failed resources (up to 3 shown) [13:05:20] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: increase icinga retries for service units [puppet] - 10https://gerrit.wikimedia.org/r/310539 (https://phabricator.wikimedia.org/T145623) (owner: 10Filippo Giunchedi) [13:05:21] PROBLEM - puppet last run on mw2088 is CRITICAL: CRITICAL: Puppet has 22 failures. Last run 6 minutes ago with 22 failures. Failed resources (up to 3 shown) [13:05:28] (03PS2) 10Filippo Giunchedi: thumbor: increase icinga retries for service units [puppet] - 10https://gerrit.wikimedia.org/r/310539 (https://phabricator.wikimedia.org/T145623) [13:05:30] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 6 minutes ago with 13 failures. Failed resources (up to 3 shown): File[/lib/systemd/system/traffic-pool.service],File[/usr/local/bin/nrpe_check_systemd_unit_state],File[/etc/logrotate.d/puppet],File[/etc/systemd/system/nginx.service.d/security.conf] [13:05:31] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 6 minutes ago with 17 failures. Failed resources (up to 3 shown) [13:05:43] oh [13:06:12] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 7 minutes ago with 7 failures. Failed resources (up to 3 shown) [13:06:13] hashar: ok, I'll go to lunch then, ping me if you need help :) [13:06:13] PROBLEM - puppet last run on sca2001 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 6 minutes ago with 5 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/blacklist-linux44.conf],File[/usr/local/share/ca-certificates/DigiCert_SHA2_High_Assurance_Server_CA.crt],File[/etc/apparmor.d/abstractions/ssl_certs],File[/usr/local/sbin/grain-ensure] [13:06:13] PROBLEM - puppet last run on restbase2008 is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 8 minutes ago with 13 failures. Failed resources (up to 3 shown) [13:06:13] PROBLEM - puppet last run on suhail is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 8 minutes ago with 8 failures. Failed resources (up to 3 shown) [13:06:13] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 7 minutes ago with 7 failures. Failed resources (up to 3 shown) [13:06:20] zeljkof: should be good :} [13:06:24] jan_drewniak: pulled on mw1099 [13:06:30] jan_drewniak: afaik it is uncached [13:06:52] <_joe_> akosiaris: can you silence damn icinga-wm? [13:06:55] <_joe_> :) [13:07:06] PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Puppet has 24 failures. Last run 7 minutes ago with 24 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/etc/apache2/mods-available/userdir.conf],File[/usr/lib/nagios/plugins/check_sysctl] [13:07:15] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 7 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/userdir.conf],File[/usr/local/sbin/puppet-run],File[/root/.screenrc],File[/usr/local/sbin/enforce-users-groups] [13:07:15] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 7 minutes ago with 17 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/userdir.conf],File[/etc/ImageMagick-6/policy.xml],File[/etc/ferm/conf.d/00_main],File[/usr/local/sbin/enforce-users-groups] [13:07:15] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 7 minutes ago with 21 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_sysctl],File[/etc/profile.d/field.sh],File[/etc/ferm/functions.conf],File[/usr/local/bin/apache-status] [13:07:26] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [13:07:47] (03CR) 10Jcrespo: [C: 031] "I am nitpicking, but can we name the class and the file equally? Maybe it is my java roots speaking, but *I know* I will make a mistake; I" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [13:07:53] !log stop ircecho (icinga-wm) temporarily on neon [13:07:57] _joe_: done [13:08:26] <_joe_> akosiaris: and puppetmaster1001 is now the ca_server [13:08:34] jan_drewniak: does it works fine on mw1099 ? [13:08:37] \o/ [13:08:52] hashar: on mw1099 it looks good [13:08:59] will sync [13:09:40] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor cgroup OOM - https://phabricator.wikimedia.org/T145623#2636666 (10Gilles) I'll turn DEBUG logging back off. It seems like most, if not all OOMs are legit, and that I've found the areas I can improve. I.e. avoiding intermediary PNGs: T... [13:09:43] <_joe_> akosiaris: only palladium is still not syncing from it [13:10:21] ok [13:10:31] !log hashar@tin Synchronized portals/prod/wikipedia.org/assets: Bumping portals to master T128546 (duration: 00m 48s) [13:10:32] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [13:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:18] !log hashar@tin Synchronized portals: Bumping portals to master T128546 (duration: 00m 47s) [13:11:20] _joe_: I am going to switch over esams [13:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:29] <_joe_> akosiaris: ok [13:11:32] jan_drewniak: it purged a single url [13:11:48] https://www.wikipedia.org/ should be up-to-date [13:12:08] next time we should make you a deployer and train you to handle the deployment :} [13:12:53] hashar: yup it looks good. I'd be up for that :) [13:13:33] (03PS1) 10Alexandros Kosiaris: puppet: Switch over esams to new infra [dns] - 10https://gerrit.wikimedia.org/r/310542 [13:14:01] (03CR) 10Alexandros Kosiaris: [C: 032] puppet: Switch over esams to new infra [dns] - 10https://gerrit.wikimedia.org/r/310542 (owner: 10Alexandros Kosiaris) [13:16:58] (03PS3) 10Muehlenhoff: Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) [13:17:09] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404#2636678 (10Gehel) elasticsearch 2.3.5 has been uploaded to reprepro (our apt repository). [13:20:12] jan_drewniak: lets do it next time you need a bump :} [13:20:38] !log renaming tables in s3 codfw - T132837 [13:20:39] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [13:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:10] hashar: sweet! [13:25:24] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636683 (10mobrovac) Since ApiQueryBacklinks is using a straight join, it might be safe enough to do it for templatelinks as well? [13:30:37] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636689 (10jcrespo) >>! In T145079#2636683, @mobrovac wrote: > Since ApiQueryBacklinks is using a straight join, it might be safe enough to do it for templateli... [13:30:59] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:31:00] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:31:09] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:31:09] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:31:10] RECOVERY - puppet last run on wtp2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:31:19] RECOVERY - puppet last run on mc2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:31:19] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:31:20] RECOVERY - puppet last run on maps2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:31:20] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:31:28] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:31:29] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:32:00] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:33:39] !log upgrading logstash to elasticsearch 2.3.5 - T145404 [13:33:39] T145404: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404 [13:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:21] (03PS1) 10Jcrespo: Make sure there is only one my.cnf, owned by root [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310545 (https://phabricator.wikimedia.org/T145375) [13:39:27] (03CR) 10Muehlenhoff: [C: 031] "Looks good" (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310545 (https://phabricator.wikimedia.org/T145375) (owner: 10Jcrespo) [13:41:37] (03CR) 10Jcrespo: Make sure there is only one my.cnf, owned by root (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310545 (https://phabricator.wikimedia.org/T145375) (owner: 10Jcrespo) [13:42:37] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:39] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2636711 (10Dzahn) [13:47:50] (03CR) 10Marostegui: [C: 031] Make sure there is only one my.cnf, owned by root [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310545 (https://phabricator.wikimedia.org/T145375) (owner: 10Jcrespo) [13:49:03] (03CR) 10Marostegui: [C: 031] Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [13:50:28] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2636714 (10grin) >>! In T144508#2625975, @Ryan_Lane wrote: > I spent years working on a project to make this possible. Before I left we had an en... [13:52:16] 06Operations, 10DBA, 10MediaWiki-Page-deletion, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636715 (10Anomie) Removing #MediaWiki-API, since this has nothing to do with the API itself. >>! In T145... [14:03:06] (03PS1) 10Filippo Giunchedi: [WIP] swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) [14:05:44] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636772 (10Anomie) a:03Anomie Thanks for looking into alternative solutions, @jcrespo. I'll write the patch for straight_join. [14:08:14] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:08:39] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2636786 (10jcrespo) I am sorry, Anomie, for this and all other issues. We are already looking at next versions of the server that will fix this and other issues... [14:09:29] 06Operations, 10DBA, 10MediaWiki-Page-deletion, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636787 (10MarcoAurelio) Is there any way via API sandbox (maxlag / maxage / smaxage) which can be used to... [14:11:23] 06Operations, 10MediaWiki-Cache, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2636791 (10hashar) [14:11:57] (03CR) 10Jcrespo: [C: 031] Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [14:12:24] (03PS1) 10Muehlenhoff: Also enable systemd for keyholder-proxy [puppet] - 10https://gerrit.wikimedia.org/r/310550 (https://phabricator.wikimedia.org/T144043) [14:13:08] moritzm, I can deploy 310528 now and test it on db1082, which is depooled [14:13:19] because marostegui was about to start it [14:13:38] No rush from my side :) [14:14:18] well, first it is on the submodule, and it does not apply to any server, and it only applies on restart [14:14:34] so I think it is a very secure patch do do :-) [14:14:41] (03PS1) 10Ema: cache_upload: do not set do_stream=true on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/310551 (https://phabricator.wikimedia.org/T131502) [14:14:52] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt] [14:17:00] jynus: sure, please go ahead. it's not yet enabled any of the role classes, though? [14:17:07] no [14:17:23] and it requires an operations/puppet deploy anyway [14:17:38] so you want to apply it only to db1082 for testing? sounds good to me [14:17:47] I will merge that first patch and mine to o/p/m [14:17:56] then to o/p [14:18:08] then we can create the role or whatever [14:18:25] ok [14:18:28] I will be doing the first 2 parts now [14:19:08] ok [14:19:39] (03CR) 10Jcrespo: [C: 032] Make /opt/wmf-mariadb10/bin/mysqld_safe managed by puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310528 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [14:20:24] (03CR) 10Jcrespo: [C: 032] Make sure there is only one my.cnf, owned by root [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310545 (https://phabricator.wikimedia.org/T145375) (owner: 10Jcrespo) [14:20:28] (03PS2) 10Jcrespo: Make sure there is only one my.cnf, owned by root [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310545 (https://phabricator.wikimedia.org/T145375) [14:21:07] 06Operations, 10ops-eqiad, 10DBA: db1082 hardware check - https://phabricator.wikimedia.org/T145607#2636820 (10Marostegui) Kernel has been upgraded to 4.4.0-2 and a full-upgrade has been performed as well. [14:21:28] moritzm, could there be issues with package + puppet interaction, like there was with mediawiki? [14:21:38] *mediawiki servers [14:22:07] or that is so unlikely until we have proper packages we do not need to care? [14:22:22] you mean hhvm? no, in that case we had an ongoing divergernt version, while in this case we'll also make the same changes in the next wmf-mariadb10 release [14:22:57] good [14:23:46] (03CR) 10Muehlenhoff: [C: 032] Also enable systemd for keyholder-proxy [puppet] - 10https://gerrit.wikimedia.org/r/310550 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff) [14:23:48] (03CR) 10Jcrespo: [V: 032] Make sure there is only one my.cnf, owned by root [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310545 (https://phabricator.wikimedia.org/T145375) (owner: 10Jcrespo) [14:26:04] (03PS2) 10Ema: cache_upload: do not set do_stream=true on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/310551 (https://phabricator.wikimedia.org/T131502) [14:26:04] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404#2636852 (10Gehel) [14:26:20] !log upgrading elasticsearch codfw to elasticsearch 2.3.5 - T145404 [14:26:21] T145404: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404 [14:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:05] (03CR) 10Ema: [C: 032 V: 032] cache_upload: do not set do_stream=true on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/310551 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [14:27:14] (03PS1) 10Jcrespo: mariadb: Add possibility of custom mysqld_safe & drop extra config [puppet] - 10https://gerrit.wikimedia.org/r/310553 [14:27:30] marostegui, check how fun submodules are^ [14:27:40] XDDD [14:27:59] do you +1 that change? [14:28:07] it is so clear! [14:28:21] (03CR) 10Marostegui: [C: 031] mariadb: Add possibility of custom mysqld_safe & drop extra config [puppet] - 10https://gerrit.wikimedia.org/r/310553 (owner: 10Jcrespo) [14:28:44] (03CR) 10Jcrespo: [C: 032] mariadb: Add possibility of custom mysqld_safe & drop extra config [puppet] - 10https://gerrit.wikimedia.org/r/310553 (owner: 10Jcrespo) [14:28:50] (03PS2) 10Jcrespo: mariadb: Add possibility of custom mysqld_safe & drop extra config [puppet] - 10https://gerrit.wikimedia.org/r/310553 [14:28:59] It will take a bit longer to convince me :p [14:30:15] !log increasing delayed allocation to 10m on elasticsearch codfw to speed up cluster restart - T145404 [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:36] waiting for ema to deploy [14:32:15] it wil be more fun when I had to revert the change if it fails [14:32:41] 06Operations, 10ops-codfw: mw2202/mw2203 failed to install - https://phabricator.wikimedia.org/T144911#2636879 (10Papaul) @MoritzMuehlenhoff Do we stay have to work on this? [14:38:22] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [14:38:43] !log change-prop deploying ddc091e [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:02] (03PS1) 10Giuseppe Lavagetto: Switch most references away from palladium [dns] - 10https://gerrit.wikimedia.org/r/310556 [14:40:55] PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:48] 06Operations, 10DBA, 10MediaWiki-Page-deletion, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2636899 (10jcrespo) @MarcoAurelio Translated, this means that the change now it is safe to be done normaly... [14:42:55] 06Operations, 10ops-codfw: mw2202/mw2203 failed to install - https://phabricator.wikimedia.org/T144911#2636901 (10MoritzMuehlenhoff) 05Open>03Resolved The installation of these and other servers failed since the kernel doesn't detect the disks within the default time span. This also happened on mw2205-mw22... [14:43:22] RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [14:53:08] (03PS1) 10Filippo Giunchedi: [WIP] prometheus: add varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/310557 [14:56:29] 06Operations, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2636931 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [15:01:38] (03CR) 10Giuseppe Lavagetto: [C: 032] Switch most references away from palladium [dns] - 10https://gerrit.wikimedia.org/r/310556 (owner: 10Giuseppe Lavagetto) [15:07:53] 06Operations, 10Traffic, 13Patch-For-Review: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1702350 (10Dzahn) @bblack This looks like it's resolved since those merges? [15:08:52] 06Operations, 10Traffic, 13Patch-For-Review: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#2636998 (10BBlack) 05Open>03Resolved a:03BBlack Well, there's long term stuff we can do better, but I guess no solid reason to keep this open in the task sense. [15:11:23] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2633433 (10Cmjohnson) I do not see anything with the server that we could pinpoint to a h/w issue. [15:18:04] (03PS1) 10Jcrespo: mariadb: control db1082's mysqld_safe file with puppet [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) [15:18:47] 06Operations, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2637063 (10MoritzMuehlenhoff) I've added a new deployment server mira02 based on jessie to deployment-prep. The arming of the keyholder went fine, there's a bug in "keyholder status" which uses the... [15:20:52] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:23] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:33] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:22:44] PROBLEM - puppet last run on analytics1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:23:35] 06Operations, 10LDAP-Access-Requests, 06TCB-Team, 06WMDE-Analytics-Engineering, and 2 others: Update wmde LDAP group - https://phabricator.wikimedia.org/T145384#2628285 (10ArielGlenn) Hey folks, while I trust @Tobi_WMDE_SW, we probably need an official ok from either a manager or someone in HR for the reco... [15:24:13] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:58] urandom: ready if you are [15:25:06] (03PS1) 10Ema: r::c::instances: split runtime_params in frontend and backend [puppet] - 10https://gerrit.wikimedia.org/r/310569 [15:25:22] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614#2637118 (10Cmjohnson) 05Open>03Resolved server removed from rack, ssd removed. added to decom tracking sheet, racktables updated. [15:25:24] (03PS9) 10Elukey: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [15:25:31] (03PS1) 10Jcrespo: Change mysqld_safe to be a module, not a role [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310570 (https://phabricator.wikimedia.org/T145378) [15:25:42] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:45] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:52] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:54] elukey: how do you want to do this? [15:25:55] 06Operations, 10LDAP-Access-Requests, 06TCB-Team, 06WMDE-Analytics-Engineering, and 2 others: Update wmde LDAP group - https://phabricator.wikimedia.org/T145384#2637124 (10Tobi_WMDE_SW) @ArielGlenn I think that would be @abraham. I'll poke him via email so he can approve here. [15:26:04] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:08] elukey: disable puppet everywhere? [15:26:21] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:37] urandom: yeah this would be great [15:26:43] <_joe_> uh what's up? [15:26:48] lol [15:27:22] <_joe_> akosiaris: the damn firewall rles for analytics and labs [15:27:32] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:37] elukey: so, an alternative, would be to just push it out, since it only goes into effect on a restart [15:27:43] <_joe_> every analytics host will do this [15:27:54] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:20] elukey: push it out, bounce the restbase staging nodes, eval, etc [15:28:24] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:32] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:36] urandom: right, so we can merge and then restart one by one, good [15:28:47] elukey: worst-case, you restart a node and it's so badly broken it won't startup [15:28:47] (03PS2) 10Filippo Giunchedi: [WIP] prometheus: add varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/310557 [15:28:50] or maybe restart a few [15:28:51] then you rollback [15:28:58] yep makes sense [15:29:00] second worst-case, it breaks logging in some what [15:29:01] way [15:29:02] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:12] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:24] elukey: so yeah, if you want to merge, I will force a puppet run and restart on restbase staging [15:30:22] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:54] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:13] urandom: ok I wanted to wait for all these puppet fails to resolve but as _joe_ mentioned they are firewall related, so safe to proceed [15:31:16] merging [15:31:23] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:37] (03CR) 10Jcrespo: [C: 032] Change mysqld_safe to be a module, not a role [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310570 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [15:31:42] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:32:12] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:32:22] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:32:25] (03PS2) 10Jcrespo: mariadb: control db1082's mysqld_safe file with puppet [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) [15:32:34] (03CR) 10Elukey: [C: 032] "Had a chat with Eric over IRC. He explained to me the change and it LGTM. The change will have effects only when cassandra will be restart" [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [15:32:52] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:33:03] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:33:12] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:33:22] <_joe_> all of analytics and labs will fail, we're working on it [15:33:27] <_joe_> it's just silly firewalls [15:33:29] sure sure [15:33:30] :) [15:33:40] elukey: we ready? [15:33:44] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:33:46] do we need to avoid palladium from now on _joe_ ? [15:33:51] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:34:03] (for the puppet-merge) [15:34:05] (03PS3) 10Filippo Giunchedi: [WIP] prometheus: add varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/310557 [15:34:20] urandom: merged, will run puppet-merge in a bit [15:34:21] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:34:46] elukey: k, fyi: i have a meeting in 25mins [15:34:50] (last 30mins) [15:34:54] lasts [15:35:17] sure sure [15:35:45] (03CR) 10Jcrespo: [C: 031] "https://puppet-compiler.wmflabs.org/4077/" [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [15:35:52] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:03] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:33] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:37:04] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:37:10] (03PS1) 10Alexandros Kosiaris: palladium: Set up a motd warning [puppet] - 10https://gerrit.wikimedia.org/r/310573 [15:37:12] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:37:42] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:37:51] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:15] (03CR) 10jenkins-bot: [V: 04-1] palladium: Set up a motd warning [puppet] - 10https://gerrit.wikimedia.org/r/310573 (owner: 10Alexandros Kosiaris) [15:38:34] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:45] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:47] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:47] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:47] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:17] (03PS1) 10Cmjohnson: Removing mgmt dns entries for dickson for decom [dns] - 10https://gerrit.wikimedia.org/r/310574 [15:39:52] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:40:23] urandom: merged [15:40:28] you are free to test [15:40:32] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:40:33] elukey: on it. [15:41:12] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:41:31] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:41:35] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for dickson for decom [dns] - 10https://gerrit.wikimedia.org/r/310574 (owner: 10Cmjohnson) [15:41:39] !log T130861: Forcing puppet run in restbase staging [15:41:40] T130861: Investigate and implement possible simplification of Cassandra Logstash filtering - https://phabricator.wikimedia.org/T130861 [15:41:42] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:02] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:20] mmm kafka1018? [15:42:23] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:33] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:33] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:50] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:55] <_joe_> elukey: it's in analytics? [15:43:02] <_joe_> anyways, no function is broken [15:43:12] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:16] 06Operations, 10hardware-requests: decommission dickson - https://phabricator.wikimedia.org/T120752#2637165 (10Cmjohnson) [15:43:21] 06Operations, 10hardware-requests: decommission dickson - https://phabricator.wikimedia.org/T120752#1860514 (10Cmjohnson) 05Open>03Resolved [15:43:25] yes yes I was just curious, theoretically it is since it communicates with hadoop [15:43:37] <_joe_> it is [15:44:00] 06Operations, 10ops-eqiad: labsdb1001: Swap eth0 cable - https://phabricator.wikimedia.org/T137555#2637169 (10Cmjohnson) 05Open>03Resolved resolving this since it's not the cable. [15:45:11] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:19] !log T130861: Restarting Cassandra, xenon.eqiad.wmnet [15:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:31] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:31] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:42] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:32] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:46:42] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:51] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:13] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:13] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:59] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:09] urandom: all good [15:48:10] ? [15:48:13] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:26] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:42] elukey: https://logstash.wikimedia.org/app/kibana#/dashboard/cassandra-eqiad?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-1h,mode:quick,to:now))&_a=(filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:_type,negate:!f,value:cassandra),query:(match:(_type:(query:cassandra,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:cluster,negat [15:48:49] whoa [15:48:52] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:54] that is one helluva link [15:49:04] elukey: https://logstash.wikimedia.org/goto/ed4a071de4391ba4b44942029dad9818 [15:49:06] lol [15:49:12] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:16] elukey: but yeah, it looks good to me [15:49:51] good :) [15:49:52] !log T130861: Performing rolling Cassandra restart, restbase staging [15:49:53] T130861: Investigate and implement possible simplification of Cassandra Logstash filtering - https://phabricator.wikimedia.org/T130861 [15:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:28] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:31] (03CR) 10Marostegui: [C: 031] mariadb: control db1082's mysqld_safe file with puppet [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [15:50:56] !log Ran T132839-Workarounds.sh from my home in terbium (see T132839) [15:50:58] T132839: Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [15:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:12] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:51:12] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:51:22] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:51:22] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:51:24] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:51:32] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:51:33] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:51:39] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:51:52] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:52:02] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:52:02] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:52:11] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:52:13] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:22] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:52:25] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:26] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:32] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:42] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:11] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:53:22] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:53:23] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:53:35] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:42] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:54:10] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:54:23] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:54:47] !log updated cr1-eqiad,cr2-eqiad puppet rules [15:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:05] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:32] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:55:50] elukey: LGTM, they're logging what I would expect, and not what I wouldn't (StatusLogger) [15:55:51] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:56:01] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:56:03] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [15:56:28] urandom: gooood! Going to restart on aqs node [15:56:33] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:56:34] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Overriding jenkins's -1. This is temporary anyway" [puppet] - 10https://gerrit.wikimedia.org/r/310573 (owner: 10Alexandros Kosiaris) [15:56:39] (03PS2) 10Alexandros Kosiaris: palladium: Set up a motd warning [puppet] - 10https://gerrit.wikimedia.org/r/310573 [15:56:41] (03CR) 10Alexandros Kosiaris: [V: 032] palladium: Set up a motd warning [puppet] - 10https://gerrit.wikimedia.org/r/310573 (owner: 10Alexandros Kosiaris) [15:56:44] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:56:59] !log restarting cassandra on aqs1001 T130861 [15:57:00] T130861: Investigate and implement possible simplification of Cassandra Logstash filtering - https://phabricator.wikimedia.org/T130861 [15:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:45] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:58:34] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:59:32] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:59:53] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:42] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:00:48] 06Operations, 10ops-eqiad: ripe-atlas should be renamed atlas-eqiad - https://phabricator.wikimedia.org/T145145#2637227 (10Cmjohnson) 05Open>03Resolved done [16:00:52] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:00:52] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:00:52] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:01:07] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:01:08] urandom: looks good even on aqs [16:01:23] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:01:47] !log starting mysql on db1082 [16:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:02] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:02:16] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:02:24] 06Operations, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2637243 (10AlexMonk-WMF) >>! In T144578#2637063, @MoritzMuehlenhoff wrote: > I've added a new deployment server mira02 I thought we had decided that when mira.deployment-prep.eqiad.wmflabs was rec... [16:02:26] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2637244 (10Cmjohnson) [16:03:03] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:32] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:37] elukey: heh [16:03:43] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:03:44] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:44] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:03:52] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:04:03] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:04:13] (03CR) 10Ema: [C: 032] r::c::instances: split runtime_params in frontend and backend [puppet] - 10https://gerrit.wikimedia.org/r/310569 (owner: 10Ema) [16:04:19] (03PS2) 10Ema: r::c::instances: split runtime_params in frontend and backend [puppet] - 10https://gerrit.wikimedia.org/r/310569 [16:04:22] (03CR) 10Ema: [V: 032] r::c::instances: split runtime_params in frontend and backend [puppet] - 10https://gerrit.wikimedia.org/r/310569 (owner: 10Ema) [16:04:51] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:05:28] !log restarting cassandra on aqs100[23] T130861 [16:05:29] T130861: Investigate and implement possible simplification of Cassandra Logstash filtering - https://phabricator.wikimedia.org/T130861 [16:05:32] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:05:33] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:36] (03Abandoned) 10Giuseppe Lavagetto: puppet: use srv records to find the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310273 (owner: 10Giuseppe Lavagetto) [16:06:08] (03PS1) 10Alexandros Kosiaris: Update palladium's motd [puppet] - 10https://gerrit.wikimedia.org/r/310577 [16:06:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Update palladium's motd [puppet] - 10https://gerrit.wikimedia.org/r/310577 (owner: 10Alexandros Kosiaris) [16:06:34] (03PS2) 10Alexandros Kosiaris: Update palladium's motd [puppet] - 10https://gerrit.wikimedia.org/r/310577 [16:06:37] (03CR) 10Alexandros Kosiaris: [V: 032] Update palladium's motd [puppet] - 10https://gerrit.wikimedia.org/r/310577 (owner: 10Alexandros Kosiaris) [16:07:03] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:07:33] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:07:52] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:09:12] RECOVERY - puppet last run on analytics1046 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:10:32] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:12:14] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:12:24] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:13:14] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:13:55] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:14] gehel: FYI we merged the logging change for Cassandra, you'd need to restart your cluster to see changes applied [16:14:24] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:14:42] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:14:44] elukey: Kool! I'll schedule that... [16:14:57] urandom: all good on aqs afaics, anything else to do together? [16:15:04] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:15:15] elukey: I should upgrade cassandra at the same time ... [16:15:23] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:16:51] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:18:53] (03PS3) 10Alexandros Kosiaris: puppetmaster: post-merge command is actually irrelevant [puppet] - 10https://gerrit.wikimedia.org/r/310301 [16:18:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: post-merge command is actually irrelevant [puppet] - 10https://gerrit.wikimedia.org/r/310301 (owner: 10Alexandros Kosiaris) [16:22:32] (03Abandoned) 10Alexandros Kosiaris: Revert "Revert "rhodium: add IPv6 AAAA and reverse"" [dns] - 10https://gerrit.wikimedia.org/r/302881 (owner: 10Alexandros Kosiaris) [16:24:11] (03Abandoned) 10Alexandros Kosiaris: puppetmaster: /var/lib/puppet/ssl should be group puppet [puppet] - 10https://gerrit.wikimedia.org/r/248302 (owner: 10Alexandros Kosiaris) [16:24:32] 06Operations, 05Prometheus-metrics-monitoring: port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2637292 (10fgiunchedi) [16:24:33] (03Abandoned) 10Alexandros Kosiaris: Remove default parameter values from puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/214624 (owner: 10Alexandros Kosiaris) [16:24:47] 06Operations, 10Wikimedia-Apache-configuration: Apache mod_status metrics only available in ganglia - https://phabricator.wikimedia.org/T141424#2637306 (10elukey) Since we are super fancy let's go directly to Prometheus! [16:25:08] (03PS3) 10Alexandros Kosiaris: Move puppetmaster::generators to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/214626 [16:25:57] (03PS3) 10Alexandros Kosiaris: puppetmaster::generators: honor the ensure attribute [puppet] - 10https://gerrit.wikimedia.org/r/214627 [16:26:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster::generators: honor the ensure attribute [puppet] - 10https://gerrit.wikimedia.org/r/214627 (owner: 10Alexandros Kosiaris) [16:26:15] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Move puppetmaster::generators to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/214626 (owner: 10Alexandros Kosiaris) [16:26:41] (03PS4) 10Filippo Giunchedi: prometheus: add varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/310557 (https://phabricator.wikimedia.org/T145659) [16:27:44] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/310557 (https://phabricator.wikimedia.org/T145659) (owner: 10Filippo Giunchedi) [16:29:18] mmhhh 16:27:43 ./manifests/site.pp:2336 WARNING puppet:// URL without modules/ found (puppet_url_without_modules) [16:29:37] which the change above doesn't touch, perhaps an earlier one [16:30:00] <_joe_> yes [16:30:01] (03Abandoned) 10Alexandros Kosiaris: Rename role::puppet::self to role::puppetmaster::self [puppet] - 10https://gerrit.wikimedia.org/r/214642 (owner: 10Alexandros Kosiaris) [16:30:03] <_joe_> akosiaris's [16:30:42] <_joe_> when he added the MOTD [16:30:44] <_joe_> I guess [16:31:15] (03PS3) 10Alexandros Kosiaris: Remove defaults from puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/214628 [16:32:02] (03CR) 10jenkins-bot: [V: 04-1] Remove defaults from puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/214628 (owner: 10Alexandros Kosiaris) [16:32:15] elukey: nope, i think we're good! [16:32:21] super :) [16:32:23] godog: .. sigh.. I guess I 'll fix that [16:32:39] I was hoping it wouldn't matter too much, I erred.. [16:32:54] !log stopping mysql and shutting down db1082 [16:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:07] heheh puppet-lint is inflexible with its judgments [16:34:11] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:24] (03PS4) 10Alexandros Kosiaris: Remove defaults from puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/214628 [16:42:42] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:42:50] (03PS5) 10Alexandros Kosiaris: Remove defaults from puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/214628 [16:43:03] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove defaults from puppetmaster::passenger [puppet] - 10https://gerrit.wikimedia.org/r/214628 (owner: 10Alexandros Kosiaris) [16:44:38] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2637413 (10Cmjohnson) 05Open>03Resolved all updated. [16:44:40] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2637415 (10Cmjohnson) [16:48:38] (03Restored) 10Alexandros Kosiaris: puppetmaster: /var/lib/puppet/ssl should be group puppet [puppet] - 10https://gerrit.wikimedia.org/r/248302 (owner: 10Alexandros Kosiaris) [16:54:16] (03PS2) 10Dzahn: base: install 'freeipmi', 'libipc-run-perl' on jessie [puppet] - 10https://gerrit.wikimedia.org/r/310369 (https://phabricator.wikimedia.org/T125205) [16:55:24] (03CR) 10jenkins-bot: [V: 04-1] base: install 'freeipmi', 'libipc-run-perl' on jessie [puppet] - 10https://gerrit.wikimedia.org/r/310369 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [16:56:09] awww. that's still the URL without modules thing [16:56:24] (03CR) 10Muehlenhoff: "Looks good to me, but I'd say let's first get 310529 and 310530 merged before applying this?" [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [16:56:41] 06Operations, 10Wikimedia-Apache-configuration: Apache mod_status metrics only available in ganglia - https://phabricator.wikimedia.org/T141424#2637489 (10fgiunchedi) @elukey yes! See also {T145659} and related [16:58:03] (03PS7) 10Alexandros Kosiaris: puppetmaster: Change the backend forced ssh command [puppet] - 10https://gerrit.wikimedia.org/r/310304 [16:58:05] (03PS6) 10Alexandros Kosiaris: puppetmaster: Make puppet-merge a template [puppet] - 10https://gerrit.wikimedia.org/r/310302 [16:58:07] (03PS6) 10Alexandros Kosiaris: puppetmaster: Delete the post-merge hooks [puppet] - 10https://gerrit.wikimedia.org/r/310303 [16:58:07] mutante: yeah, it's my fault, fixing [16:58:09] (03PS2) 10Alexandros Kosiaris: puppetmaster: Add --quiet option to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/310515 [16:58:21] I did override jenkins, did not expect it would trigger on every run [16:58:33] akosiaris: thanks, i was about to add "# lint:ignore:puppet_url_without_modules" around it [16:58:42] then you dont have to really move it [16:59:02] er... that was my fix anyway [16:59:07] ;-) [16:59:09] :) [16:59:35] 06Operations, 10Traffic: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2637510 (10ema) [16:59:38] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Change the backend forced ssh command [puppet] - 10https://gerrit.wikimedia.org/r/310304 (owner: 10Alexandros Kosiaris) [16:59:41] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Make puppet-merge a template [puppet] - 10https://gerrit.wikimedia.org/r/310302 (owner: 10Alexandros Kosiaris) [16:59:58] (03PS1) 10Volans: Salt: use puppetmaster CNAME [puppet] - 10https://gerrit.wikimedia.org/r/310587 (https://phabricator.wikimedia.org/T143536) [17:00:04] 06Operations, 10Traffic: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2637525 (10ema) p:05Triage>03High [17:00:05] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Delete the post-merge hooks [puppet] - 10https://gerrit.wikimedia.org/r/310303 (owner: 10Alexandros Kosiaris) [17:00:34] (03PS2) 10Volans: Salt: use puppetmaster CNAME [puppet] - 10https://gerrit.wikimedia.org/r/310587 (https://phabricator.wikimedia.org/T143536) [17:00:49] (03PS8) 10Alexandros Kosiaris: puppetmaster: Change the backend forced ssh command [puppet] - 10https://gerrit.wikimedia.org/r/310304 [17:00:51] (03PS7) 10Alexandros Kosiaris: puppetmaster: Make puppet-merge a template [puppet] - 10https://gerrit.wikimedia.org/r/310302 [17:00:53] (03PS7) 10Alexandros Kosiaris: puppetmaster: Delete the post-merge hooks [puppet] - 10https://gerrit.wikimedia.org/r/310303 [17:00:55] (03PS3) 10Alexandros Kosiaris: puppetmaster: Add --quiet option to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/310515 [17:00:57] (03PS1) 10Alexandros Kosiaris: palladium: Add lint:ignore rules for it [puppet] - 10https://gerrit.wikimedia.org/r/310588 [17:01:33] (03CR) 10jenkins-bot: [V: 04-1] Salt: use puppetmaster CNAME [puppet] - 10https://gerrit.wikimedia.org/r/310587 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [17:03:35] akosiaris: puppetlint CI job broken it's already known? [17:04:27] volans: yes, the "Add lint:ignore" change will fix it [17:05:01] volans: actually, just waiting for jenkins to +1 https://gerrit.wikimedia.org/r/310588 [17:05:19] (03CR) 10Dzahn: [C: 031] palladium: Add lint:ignore rules for it [puppet] - 10https://gerrit.wikimedia.org/r/310588 (owner: 10Alexandros Kosiaris) [17:05:23] yep just passed [17:05:24] great! [17:05:36] (03CR) 10Alexandros Kosiaris: [C: 032] palladium: Add lint:ignore rules for it [puppet] - 10https://gerrit.wikimedia.org/r/310588 (owner: 10Alexandros Kosiaris) [17:05:48] merged [17:06:03] (03PS3) 10Volans: Salt: use puppetmaster CNAME [puppet] - 10https://gerrit.wikimedia.org/r/310587 (https://phabricator.wikimedia.org/T143536) [17:06:07] rebasing [17:06:18] (03PS1) 10Ema: varnish: apply nuke_limit and lru_interval bumps to upload only [puppet] - 10https://gerrit.wikimedia.org/r/310589 (https://phabricator.wikimedia.org/T145661) [17:07:32] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:37] (03CR) 10Volans: [C: 032] Salt: use puppetmaster CNAME [puppet] - 10https://gerrit.wikimedia.org/r/310587 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [17:08:13] akosiaris: I'm running puppet-merge, I've your change too [17:08:22] can I go ahead? [17:09:53] akosiaris: merged, looked safe enough ;) [17:11:30] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/310369 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [17:11:37] cool [17:12:05] (03CR) 10Phuedx: [C: 04-2] "Blocked on T145379." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310483 (https://phabricator.wikimedia.org/T136746) (owner: 10Jhobs) [17:19:11] !log reimage mw2198 as it failed before [17:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:26] (03CR) 10Dzahn: [C: 032] base: install 'freeipmi', 'libipc-run-perl' on jessie [puppet] - 10https://gerrit.wikimedia.org/r/310369 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [17:19:31] (03PS3) 10Dzahn: base: install 'freeipmi', 'libipc-run-perl' on jessie [puppet] - 10https://gerrit.wikimedia.org/r/310369 (https://phabricator.wikimedia.org/T125205) [17:20:17] !log change-prop deploying 19e2d51 [17:20:19] that will install 2 packages on all servers. it _might_ cause some of those false positives where it says DPKG broken, but actually wants to say "i was in the middle of installing while you checked" [17:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:25] 06Operations, 10DBA, 10MediaWiki-Page-deletion, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2637627 (10aaron) There is a DeleteBatch maintenance script that could take a page via stdin or a list of... [17:21:23] (03PS2) 10Ema: varnish: apply nuke_limit and lru_interval bumps to upload only [puppet] - 10https://gerrit.wikimedia.org/r/310589 (https://phabricator.wikimedia.org/T145661) [17:26:23] (03PS3) 10Ema: varnish: apply nuke_limit and lru_interval bumps to upload only [puppet] - 10https://gerrit.wikimedia.org/r/310589 (https://phabricator.wikimedia.org/T145661) [17:26:32] (03CR) 10Ema: [C: 032 V: 032] varnish: apply nuke_limit and lru_interval bumps to upload only [puppet] - 10https://gerrit.wikimedia.org/r/310589 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [17:27:00] 06Operations, 06Operations-Software-Development: Puppet deprecation warning - https://phabricator.wikimedia.org/T145544#2637634 (10Volans) [17:27:44] (03PS2) 10Dzahn: monitoring: add check_ipmi_sensor plugin [puppet] - 10https://gerrit.wikimedia.org/r/310379 (https://phabricator.wikimedia.org/T125205) [17:28:46] 06Operations: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2637638 (10akosiaris) 05Open>03stalled p:05High>03Low I am thinking this is no longer happening. I haven't yet witnessed it on strontium. Probably the gerrit migration to a newer version fixed some o... [17:29:00] (03CR) 10Dzahn: [C: 032] monitoring: add check_ipmi_sensor plugin [puppet] - 10https://gerrit.wikimedia.org/r/310379 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [17:33:37] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:33:54] 06Operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 4 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#2637645 (10Dzahn) The needed packages and the plugin script should now get installed on all jessie hosts. Waiting for that, then we can add the actual NRPE command... [17:33:56] 06Operations, 06Operations-Software-Development: Puppet deprecation warning - https://phabricator.wikimedia.org/T145544#2637646 (10akosiaris) Running `puppet agent -t -v` prints that too. It's expected currently, but we are in an always ongoing migration to have everything under modules/ (including templates/) [17:34:09] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet runs fails randomly on deployment-prep / beta cluster hosts - https://phabricator.wikimedia.org/T145631#2636458 (10AlexMonk-WMF) I'm guessing most of these would be T131946 (my 9th April 18:49 UTC comment through to my 12th September 05:48 UTC comment, see also P... [17:34:49] 07Puppet, 07Technical-Debt: "Setting templatedir is deprecated" warning issued on self-hosted puppetmaster - https://phabricator.wikimedia.org/T95158#2637664 (10akosiaris) [17:34:51] 06Operations, 06Operations-Software-Development: Puppet deprecation warning - https://phabricator.wikimedia.org/T145544#2637666 (10akosiaris) [17:40:47] (03PS3) 10Dzahn: Revert "archiva: migration class to rsync data to new host" [puppet] - 10https://gerrit.wikimedia.org/r/307900 (https://phabricator.wikimedia.org/T123725) [17:43:03] (03CR) 10Dzahn: [C: 032] "archiva has been switched over to the new server" [puppet] - 10https://gerrit.wikimedia.org/r/307900 (https://phabricator.wikimedia.org/T123725) (owner: 10Dzahn) [17:43:28] (03PS4) 10Dzahn: Revert "archiva: migration class to rsync data to new host" [puppet] - 10https://gerrit.wikimedia.org/r/307900 (https://phabricator.wikimedia.org/T123725) [17:50:44] !log meitnerium - stop rsyncd, remove config fragments [17:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:52:32] 30 messages for me? that's gotta be a mistkae [17:53:31] !log meitnerium - oops, an unrelated rsyncd is supposed to be running on this, puppet re-created files [17:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:27] ah, it's just puppet failure messages [17:54:30] nvm then [17:56:50] 06Operations, 10DBA, 10MediaWiki-Page-deletion, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2637707 (10aaron) 05Open>03Resolved a:03aaron I deleted both now. [17:57:10] !log Deleted big pages per https://meta.wikimedia.org/w/index.php?title=Steward_requests/Miscellaneous&oldid=15908701#Deleting_a_pages_with_a_.3E5000_revisions_in_ruwiki [17:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:52] (03PS1) 10Dzahn: archiva/site: remove titanium, fix role syntax [puppet] - 10https://gerrit.wikimedia.org/r/310596 (https://phabricator.wikimedia.org/T123725) [18:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160914T1800). [18:00:04] kaldari, jhobs, and yurik: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:28] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet runs fails randomly on deployment-prep / beta cluster hosts - https://phabricator.wikimedia.org/T145631#2637713 (10Krinkle) [18:00:35] i'm around [18:00:54] I can SWAT today [18:01:25] kaldari: yurik ping for SWAT [18:02:42] jhobs: where did you need this backported? Just for wmf.18? Looks like it should have made the cutoff for the newest branch. [18:02:48] 06Operations, 10Analytics-Cluster: decom titanium - https://phabricator.wikimedia.org/T145666#2637714 (10Dzahn) [18:02:57] thcipriani: yeah just wmf.18 [18:03:09] ack, doing [18:03:14] here [18:03:16] we're getting to the SWAT later than we should have due to a mixup in priority [18:03:19] thcipriani: ^ [18:03:40] kaldari: hello :) [18:04:06] (03PS2) 10Dzahn: archiva/site: remove titanium, fix role syntax [puppet] - 10https://gerrit.wikimedia.org/r/310596 (https://phabricator.wikimedia.org/T123725) [18:04:11] (03CR) 10Dzahn: [C: 032] archiva/site: remove titanium, fix role syntax [puppet] - 10https://gerrit.wikimedia.org/r/310596 (https://phabricator.wikimedia.org/T123725) (owner: 10Dzahn) [18:04:22] (03PS1) 10Dzahn: install_server: remove host titanium [puppet] - 10https://gerrit.wikimedia.org/r/310597 (https://phabricator.wikimedia.org/T145666) [18:04:38] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ferm] [18:04:40] jhobs: so through the gerrit interface, I'm getting, "Cherry pick failed: merge conflict" Looks like it needs a little manual work. Could you do that? [18:05:13] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: decom titanium - https://phabricator.wikimedia.org/T145666#2637734 (10Dzahn) [18:05:49] thcipriani: I'm not sure how to cherry pick it to a previous branch; I've never had to do that before [18:06:41] to a different release branch* [18:07:31] (03CR) 10Dzahn: [C: 032] install_server: remove host titanium [puppet] - 10https://gerrit.wikimedia.org/r/310597 (https://phabricator.wikimedia.org/T145666) (owner: 10Dzahn) [18:07:38] (03PS2) 10Dzahn: install_server: remove host titanium [puppet] - 10https://gerrit.wikimedia.org/r/310597 (https://phabricator.wikimedia.org/T145666) [18:07:56] jhobs: so you just will need to fetch down the wmf.18 branch of ZeroBanner check that out and use git cherry-pick to get 65898f7a21ea1b0aba8bcd1752ed67b5cf9847ec added to that branch [18:08:08] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 739 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4828863 keys - replication_delay is 739 [18:08:09] PROBLEM - Redis status tcp_6480 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 645 600 - REDIS on 10.192.48.44:6480 has 1 databases (db0) with 4884170 keys - replication_delay is 645 [18:08:19] (03PS1) 10Volans: Salt: avoid Puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/310598 (https://phabricator.wikimedia.org/T145544) [18:08:37] jhobs: it looks like that commit has some conflicts with some code on wmf.18, so I can't do it via the gerrit interface. You will have some conflicts to resolve. [18:08:39] thcipriani: right, what's the remote for the wmf.18 branch, "git fetch origin wmf/wmf.18"? [18:08:46] thcipriani, awesome, let me know when its out [18:09:42] jhobs: the branch is wmf/1.28.0-wmf.18 [18:10:03] thcipriani: ok, I'll get on it and get back to you [18:10:11] jhobs: thanks [18:10:27] (03PS2) 10Volans: Salt: avoid Puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/310598 (https://phabricator.wikimedia.org/T145544) [18:10:34] (03PS1) 10Dzahn: remove titanium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/310599 (https://phabricator.wikimedia.org/T145666) [18:12:23] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2637767 (10Dzahn) [18:12:26] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2637765 (10Dzahn) 05Open>03Resolved titanium has been replaced by meitnerium. This is done. The remaining decom steps (up to physically removing it... [18:12:40] thcipriani: take yer time [18:12:52] 06Operations, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2637771 (10Dzahn) [18:14:18] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: decom titanium - https://phabricator.wikimedia.org/T145666#2637783 (10Dzahn) a:05Dzahn>03None removing from pupet, preparing the decom, will wait another couple days or so before physical shutdown and removal from DNS [18:14:28] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: decom titanium - https://phabricator.wikimedia.org/T145666#2637785 (10Dzahn) a:03Dzahn [18:15:29] (03CR) 10Volans: [C: 032] Salt: avoid Puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/310598 (https://phabricator.wikimedia.org/T145544) (owner: 10Volans) [18:15:52] #mediawiki_security [18:15:52] kaldari: https://gerrit.wikimedia.org/r/#/c/310478/ should be live on mw1099, check please [18:15:55] oops [18:16:11] thcipriani: should be good to go here: https://gerrit.wikimedia.org/r/#/c/310601 [18:16:31] thcipriani: checking... [18:16:35] once jenkins does its thing, that is [18:16:36] jhobs: awesome checking [18:17:07] !log T133805: Renabling Pupppet, forcing run, and restarting Cassandra to restore 8M region size on restbase1013-a.eqiad.wmnet [18:17:08] T133805: Isolated testing of GC settings for aggressive Cassandra chunk_length_kb values - https://phabricator.wikimedia.org/T133805 [18:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:17:56] PROBLEM - Redis status tcp_6380 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 618 600 - REDIS on 10.192.0.120:6380 has 1 databases (db0) with 4878149 keys - replication_delay is 618 [18:18:50] PROBLEM - Redis status tcp_6481 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 702 600 - REDIS on 10.192.48.44:6481 has 1 databases (db0) with 4875250 keys - replication_delay is 702 [18:19:27] thcipriani: looks good [18:19:35] thcipriani: feel free to sync [18:19:41] kaldari: ack, thanks, pushing everywhere [18:19:49] PROBLEM - Redis status tcp_6379 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 690 600 - REDIS on 10.192.0.120:6379 has 1 databases (db0) with 9587269 keys - replication_delay is 690 [18:20:21] thcipriani: everytime you say "ack" I think something has gone horribly wrong :) [18:20:26] ^ [18:20:27] :D [18:20:42] acknowledge, sorry, will switch back to just "k" :) [18:20:49] PROBLEM - Redis status tcp_6381 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 700 600 - REDIS on 10.192.0.120:6381 has 1 databases (db0) with 4880655 keys - replication_delay is 700 [18:20:53] or 10-4 if you want to be fancy [18:20:54] aaaaah [18:20:58] now it makes sense [18:21:00] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:23:07] jhobs: is that an "aaaaah" sigh of relief or an "aaaaah" scream for help? [18:23:13] 10-4, I do want to be fancy [18:23:20] !log thcipriani@tin Synchronized php-1.28.0-wmf.19/includes/pager/ReverseChronologicalPager.php: SWAT: [[gerrit:310478|Partially reverting I8e684f06 to restore some legacy behavior (T145597)]] (duration: 00m 48s) [18:23:21] T145597: Contributions offset broken on test wikipedia - https://phabricator.wikimedia.org/T145597 [18:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:26] ^ kaldari sync'd everywhere [18:23:30] kaldari: a sigh of screaming acknowledgment ;) [18:23:33] checking .... [18:24:08] thcipriani: yay, bug fixed, disaster averted! just another day at the office [18:24:19] kaldari: glad to hear :) [18:25:30] yurik: kartographer js change live on mw1099, check please [18:25:40] * yurik checks [18:27:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:28:04] RECOVERY - Redis status tcp_6380 on rdb2002 is OK: OK: REDIS on 10.192.0.120:6380 has 1 databases (db0) with 4814974 keys - replication_delay is 4 [18:28:29] RECOVERY - Redis status tcp_6480 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6480 has 1 databases (db0) with 4820008 keys - replication_delay is 0 [18:28:31] RECOVERY - Redis status tcp_6381 on rdb2002 is OK: OK: REDIS on 10.192.0.120:6381 has 1 databases (db0) with 4817052 keys - replication_delay is 0 [18:28:58] RECOVERY - Redis status tcp_6481 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6481 has 1 databases (db0) with 4811281 keys - replication_delay is 0 [18:29:02] ^ Hrm, lots of redis-related things in fatalmonitor should I pause SWAT? Is there an investigation happening? [18:29:40] (everything yet to be deployed is seemingly very unrelated to anything in redis) [18:30:01] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:30:01] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:30:01] RECOVERY - Redis status tcp_6379 on rdb2002 is OK: OK: REDIS on 10.192.0.120:6379 has 1 databases (db0) with 9523630 keys - replication_delay is 0 [18:30:49] seems like most errors have subsided along with these recoveries [18:33:02] thcipriani, are you sure its on 1099? [18:35:21] yurik: yup, grep for 'window' on that file on mw1099 seems to return what's expected. [18:37:25] thcipriani, not sure - I'm trying it with https://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F:%D0%A4%D0%BE%D1%80%D1%83%D0%BC/%D0%9D%D0%BE%D0%B2%D0%BE%D1%81%D1%82%D0%B8?useskin=monobook&debug=1 --- and it looks like the _fixMapSize function still uses document [18:37:51] thcipriani, better link: https://ru.wikipedia.org/wiki/Википедия:Форум/Новости?useskin=monobook&debug=1#/maplink/1 [18:37:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4814635 keys - replication_delay is 0 [18:38:41] yurik: oh! wmf.19 hasn't hit wikipedias yet, still in group0: https://tools.wmflabs.org/versions/ [18:39:02] thcipriani, silly me, we should patch 18 too :( [18:39:19] should i make the cherrypick? [18:39:20] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2454047 (10Dzahn) current access log is only about 9% https and a chunk of that is all Catchpoint monitoring. About 32% are from python-requests UA... [18:39:25] yurik: sure, please do [18:39:44] thcipriani, https://gerrit.wikimedia.org/r/#/c/310608/ [18:40:48] thcipriani, i will test it with mediawiki in the meantime [18:41:01] yurik: kk, thanks :) [18:42:08] thcipriani, yep, works good [18:42:22] go ahead and sync both to all [18:42:23] yurik: kk, going live everywhere on wmf.19 [18:42:42] thcipriani, you might as well do 18 to all too :) [18:42:49] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2637840 (10Dzahn) >>! In T140128#2625078, @AlexMonk-WMF wrote: > Can you filter those access logs down to labs entries only (208.80.155.128 - 208.80... [18:43:07] yurik: whenever that one gets merged, I'll push that out, too [18:43:12] still waiting on zuul/jenkins [18:43:38] three things humans can do indefinitly: watch fire, water, and zuul merging [18:43:59] :D [18:45:15] !log thcipriani@tin Synchronized php-1.28.0-wmf.19/extensions/Kartographer/modules/box/Map.js: SWAT: [[gerrit:310590|Map should take viewport width/height instead of body width/height (T145521)]] (duration: 00m 47s) [18:45:17] T145521: does not work in Monobook skin - https://phabricator.wikimedia.org/T145521 [18:45:18] yurik: ^ live on wmf.19 [18:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:24] kk [18:46:07] waiting on jenkins is the new "compiling" [18:46:21] jhobs: your wmf.18 zerobanner change is live on mw1099, check please [18:46:27] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:47:09] * jhobs checking... [18:47:28] it's either "waiting on jenkins" or "vagrant up" is the new "compiling" [18:47:44] vagrant git-update if you really wanna be ballsy ;) [18:47:49] thcipriani: LGTM! [18:48:09] jhobs: kk, syncing live [18:49:04] o/ [18:49:42] hashar: hello [18:49:53] just finishing up "morning" swat [18:50:13] {POV|Merican} [18:50:24] !log thcipriani@tin Synchronized php-1.28.0-wmf.18/extensions/ZeroBanner/modules: SWAT: [[gerrit:310601|Display edit icon and page actions]] (duration: 00m 47s) [18:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:35] ^ jhobs change is live everywhere [18:51:18] thcipriani: hmm... not seeing on enwiki :/ [18:51:24] I don't know I would still call it morning even for the west-coast of America [18:51:54] 9 minutes of morning left here [18:52:27] jhobs: hmm, javascript and css changes can take a few to appear https://wikitech.wikimedia.org/wiki/How_to_deploy_code#A_note_on_JavaScript_and_CSS [18:52:43] thcipriani: yeah... I thought I cache-busted, but I'll give it a bit [18:52:48] thcipriani, that link still shows document. not window. [18:52:52] 1099 showed it instantly though [18:53:15] !log RESTBase deploy d39580f14 canary on restbase1007 [18:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:25] i checked - mw1099 uses wmf18 [18:53:34] but no patch [18:53:40] queries to mw1099 are uncached / bypass RL cache iirc [18:53:54] yurik: for wmf.18? Haven't sync'd the patch for kartographer there yet [18:54:05] ah, i thought you already synced that one [18:54:06] sorry [18:54:10] thcipriani: take your time for the swat, can delay group1 as needed [18:54:24] thcipriani: we're all good now, thanks for the help! [18:54:48] thcipriani, on mw.org it works fine [18:54:57] jhobs: glad to hear it, thanks for the cherry-picks :) [18:56:05] yurik: wmf.18 is live on mw1099, if there's anything you'd like to check there [18:56:40] thcipriani, yep, works great [18:57:15] syncing everywhere now [18:57:29] speaking of css I got some fonts on https://wikitech.wikimedia.org/wiki/Swift/Swift_Container_Name_Conventions [18:57:45] the skin seems to use monospace for everything, but maybe it is just me [18:57:56] !log thcipriani@tin Synchronized php-1.28.0-wmf.18/extensions/Kartographer/modules/box/Map.js: SWAT: [[gerrit:310608|Map should take viewport width/height instead of body width/height (T145521)]] (duration: 00m 47s) [18:57:57] T145521: does not work in Monobook skin - https://phabricator.wikimedia.org/T145521 [18:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:01] ^ yurik should be live everywhere [18:58:23] thcipriani, yep, thanks!!! [18:58:28] eh, that is kind of weird [18:58:31] Public files: / maps to the shard "". [18:58:35] from the wiki text [18:58:43] what's with the [18:58:56] hashar: looks like it's leading some tags https://wikitech.wikimedia.org/w/index.php?title=Swift/Swift_Container_Name_Conventions&action=edit§ion=1 [18:59:01] leaking* [18:59:03] !log RESTBase deploy d39580f14 [18:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:59:19] jhobs: yeah and that is a bug really [18:59:38] maybe so, but it's also incorrect wikitext [19:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160914T1900). [19:00:38] hashar: swat is complete, run the train when ready [19:02:06] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:02:28] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:05:04] filled it https://phabricator.wikimedia.org/T145671 [19:05:09] will maybe look at it later [19:05:56] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:06:20] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310612 (owner: 10Hashar) [19:06:50] Pchelolo: looks like restbase is not happy [19:07:06] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:07:06] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:07:11] hashar: looking [19:07:15] !log titanium - puppet node clean [19:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:23] Pchelolo: just revert? [19:07:36] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:07:50] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:08:03] hashar: nope, it should get back to normal [19:08:38] hashar: oh, nope, reveting [19:08:47] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:09:05] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:09:35] !log revert RESTBase to d10d759 [19:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:05] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:10:06] * hashar waits for stuff to settle [19:10:58] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [19:11:00] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:11:38] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:11:47] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [19:11:54] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:12:11] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [19:12:11] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [19:12:27] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:12:45] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [19:12:49] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:13:03] jdlrobson: hiya, q for you about trending project, what IRC room do yall hang out in? I'll come ask there [19:13:16] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:13:41] (03PS1) 10Dzahn: replace deprecated puppetstoredconfigclean.rb [puppet] - 10https://gerrit.wikimedia.org/r/310613 [19:13:44] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [19:14:11] lets assume RestBase is back :D [19:14:20] Pchelolo: please deploy out of the MW train window :] [19:14:40] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [19:14:41] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:14:44] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.19 [19:14:44] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [19:14:49] hashar: ok, will do next time. RB is back to normal [19:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:04] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [19:15:04] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [19:15:09] (03CR) 10jenkins-bot: [V: 04-1] replace deprecated puppetstoredconfigclean.rb [puppet] - 10https://gerrit.wikimedia.org/r/310613 (owner: 10Dzahn) [19:16:01] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [19:16:02] Pchelolo: great! [19:16:09] 19:14:49 utils/hiera_lookup:55:44: C: Use || instead of or. [19:16:10] 19:14:49 if scope['::fqdn'].end_with?('.wmflabs') or scope['::fqdn'].end_with?('.labtest') [19:16:13] uhhhh [19:16:24] Pchelolo: good luck finding out what ever issue happens :] For the MW train it is done and looks fine now [19:16:26] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [19:16:26] rake-jessie fail is odd [19:16:38] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [19:16:44] mutante: ah that is rubocop [19:17:06] mutante: you can repo locally with: bundle install ; bundle exec rubocop utils/hiera_lookup [19:17:09] (03PS2) 10Dzahn: replace deprecated puppetstoredconfigclean.rb [puppet] - 10https://gerrit.wikimedia.org/r/310613 [19:17:22] not sure how that one leaked though [19:17:24] does not make sense [19:17:41] hashar: ok, let's see if it was just a one-off [19:17:46] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [19:18:13] mutante: na i can reproduce locally [19:18:14] mutante: [19:18:53] (03CR) 10jenkins-bot: [V: 04-1] replace deprecated puppetstoredconfigclean.rb [puppet] - 10https://gerrit.wikimedia.org/r/310613 (owner: 10Dzahn) [19:19:01] alright, yea, i dont edit .rb files that often [19:19:06] (03PS1) 10Volans: wmf-reimage: fix puppet clean certificate and facts [puppet] - 10https://gerrit.wikimedia.org/r/310614 (https://phabricator.wikimedia.org/T145544) [19:19:16] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [19:19:32] (03PS1) 10Hashar: utils: fix rubocop issue in hiera_lookup [puppet] - 10https://gerrit.wikimedia.org/r/310615 [19:19:38] mutante: ^^ fix [19:19:40] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [19:19:40] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [19:19:43] wow, so fast :) [19:19:48] was there a deploy now ? [19:19:53] the issue is rake-jessie was not triggered for a change that only changed /utils/hiera_lookup [19:19:56] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [19:20:00] the job has been made to only trigger when .rb files are altered [19:20:05] tis a bug [19:20:18] if there was RTL wikis broke RTL placement just now [19:20:31] (03CR) 10jenkins-bot: [V: 04-1] wmf-reimage: fix puppet clean certificate and facts [puppet] - 10https://gerrit.wikimedia.org/r/310614 (https://phabricator.wikimedia.org/T145544) (owner: 10Volans) [19:21:10] I see there was [19:21:23] Who can help ? hashar ? [19:22:31] (03CR) 10Dzahn: [C: 032] "19:18:49 utils/hiera_lookup:55:44: C: Use || instead of or." [puppet] - 10https://gerrit.wikimedia.org/r/310615 (owner: 10Hashar) [19:22:43] (03CR) 10Dzahn: "thanks for the super quick fix" [puppet] - 10https://gerrit.wikimedia.org/r/310615 (owner: 10Hashar) [19:22:56] matanya: what do you mean? [19:23:04] hashar: https://he.wikipedia.org/wiki/%D7%A7%D7%99%D7%95%D7%91 [19:23:15] you gotta explain [19:23:19] I dont read hebrew :] [19:23:19] templates moved to the right, instead of being on the left [19:23:23] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/310614 (https://phabricator.wikimedia.org/T145544) (owner: 10Volans) [19:23:25] AHHH [19:23:28] css / RTL issue [19:23:30] gotta revert [19:23:31] yes [19:23:51] matanya: can you take a screenshot please ? [19:23:56] (03CR) 10Dzahn: [C: 031] "the -1 from jenkins should be gone now (fixed by https://gerrit.wikimedia.org/r/#/c/310615/1)" [puppet] - 10https://gerrit.wikimedia.org/r/310614 (https://phabricator.wikimedia.org/T145544) (owner: 10Volans) [19:23:56] sure [19:23:59] thx [19:24:01] I am reverting [19:24:20] (03PS2) 10Volans: wmf-reimage: fix puppet clean certificate and facts [puppet] - 10https://gerrit.wikimedia.org/r/310614 (https://phabricator.wikimedia.org/T145544) [19:24:44] hashar: any ticket to add the screenshot to? [19:25:16] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: Revert group1. Hebrew wiki has templates on the wrong side / CSS is off [19:25:20] guess you can fill one :] [19:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:25] you are the very first to report the issue [19:25:32] I have pushed a new mw version to the wikis [19:25:37] like 5 mins ago [19:25:47] and have reverted the upgrade. Looks like the template is all fine now [19:26:25] (03PS3) 10Volans: wmf-reimage: fix puppet clean certificate and facts [puppet] - 10https://gerrit.wikimedia.org/r/310614 (https://phabricator.wikimedia.org/T145544) [19:26:35] (03Abandoned) 10Dzahn: replace deprecated puppetstoredconfigclean.rb [puppet] - 10https://gerrit.wikimedia.org/r/310613 (owner: 10Dzahn) [19:26:35] thanks hashar. i will report [19:28:13] (03PS1) 10Hashar: Revert "group1 wikis to 1.28.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310617 (https://phabricator.wikimedia.org/T143328) [19:28:54] matanya: please make it with priority "Unbreak Now" and ideally with parent task T143328 (that is the mw upgrade task) [19:28:55] T143328: MW-1.28.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T143328 [19:29:11] doing hashar [19:29:45] (03CR) 10Hashar: [C: 032] "Already reverted. Did it because hewiki (and probably any RTL) had css mixing up with infobox floating on the wrong side." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310617 (https://phabricator.wikimedia.org/T143328) (owner: 10Hashar) [19:30:24] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.28.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310617 (https://phabricator.wikimedia.org/T143328) (owner: 10Hashar) [19:31:23] looks like the beta cluster (running master) float on the proper side https://he.wikipedia.beta.wmflabs.org/wiki/%D7%A1%D7%A2%D7%99%D7%93_%D7%92%27%D7%9C%D7%99%D7%9C%D7%99 :D [19:31:42] hashar: that is just an image, not a template [19:33:10] hashar: reported [19:33:17] https://phabricator.wikimedia.org/T145673 [19:33:27] nice thank you! [19:33:35] now one has to figure out what is wrong :] [19:35:22] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [19:35:30] hashar: beta doesn't have any RTL template [19:36:53] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:38:25] (03PS4) 10Volans: wmf-reimage: fix puppet clean certificate and facts [puppet] - 10https://gerrit.wikimedia.org/r/310614 (https://phabricator.wikimedia.org/T145544) [19:38:38] mutante: you ok to merge ^^^ ? [19:38:49] matanya: I am doing an export + import :D [19:38:51] I've removed the references in rubocop too [19:39:25] thanks hashar. let me know if you need any help [19:39:43] oh, interesting that the file was in an exclude list [19:40:05] matanya: https://he.wikipedia.beta.wmflabs.org/wiki/%D7%A7%D7%99%D7%95%D7%91 ? [19:40:12] does that look any good ? :] [19:40:21] yes, broken :) [19:40:26] GOOD [19:40:29] (03PS1) 10Yuvipanda: labs: Introduce httpyaml backend for labtest [puppet] - 10https://gerrit.wikimedia.org/r/310619 (https://phabricator.wikimedia.org/T133412) [19:40:30] we have a repro :] [19:40:43] :D [19:40:44] matanya: can you please screenshot that page and add it to the task description as well ? :] [19:40:52] sure [19:41:29] (03CR) 10Dzahn: [C: 032] wmf-reimage: fix puppet clean certificate and facts [puppet] - 10https://gerrit.wikimedia.org/r/310614 (https://phabricator.wikimedia.org/T145544) (owner: 10Volans) [19:41:31] volans: yea [19:41:48] (03CR) 10jenkins-bot: [V: 04-1] labs: Introduce httpyaml backend for labtest [puppet] - 10https://gerrit.wikimedia.org/r/310619 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [19:42:45] heh, my comment is identical to yours [19:42:46] mutante: ok, running puppet merge [19:43:13] volans: i just did [19:43:16] hashar: I added another deployment blocker, global rename seems to have broken [19:43:34] full service :D [19:44:24] (03PS2) 10Yuvipanda: labs: Introduce httpyaml backend for labtest [puppet] - 10https://gerrit.wikimedia.org/r/310619 (https://phabricator.wikimedia.org/T133412) [19:45:23] :) [19:45:28] (03CR) 10jenkins-bot: [V: 04-1] labs: Introduce httpyaml backend for labtest [puppet] - 10https://gerrit.wikimedia.org/r/310619 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [19:49:32] 06Operations, 06Developer-Relations (Jul-Sep-2016): Operations Team Offsite - https://phabricator.wikimedia.org/T141940#2517241 (10Dzahn) It's going to be in Barcelona, Spain. [19:51:05] legoktm: it got reverted hasn't it ? [19:51:27] hashar: yes. but as a blocker for futher rollout [19:51:40] ah [19:51:50] I got confused with the other reverted you asked (and done by https://gerrit.wikimedia.org/r/#/c/310440/ ) [19:51:53] so I thought it was ok :( [19:52:07] ok, I might need to make the rake test non voting [19:52:14] it won't let me have a Ruby class named Httpyaml_backend (needs to be camelcase) [19:52:23] but that's what the puppet api requires (_backend suffix) [19:52:25] HttpYamlBackend [19:52:30] hehe [19:52:43] you can add exceptions :] [19:52:51] if a test is voting, it should pass on all current code, and this test does not [19:52:57] well [19:53:00] if you send new code ... [19:53:04] there's plenty of other ruby code this is failing, hashar. all the other ruby backends are also doing this [19:53:22] there's Mwyaml_backend, Nuyaml_backend, etc [19:53:39] I don't really want to go touch them all, and this is a nasty surprise waiting for anyone who touches those next [19:54:06] looks like the _backend.rb are ignored in .rubocop_todo.yml [19:54:14] but most probably, that rule should be dropped [19:55:36] or add an exception [19:56:47] legoktm: l suspect the breakage is related to your changes to scribunto [19:57:02] huh??? [19:57:03] (03PS3) 10Yuvipanda: labs: Introduce httpyaml backend for labtest [puppet] - 10https://gerrit.wikimedia.org/r/310619 (https://phabricator.wikimedia.org/T133412) [19:57:16] (03PS1) 10Hashar: rubocop: ignore underscore in class of puppet backends [puppet] - 10https://gerrit.wikimedia.org/r/310621 [19:57:24] yuvipanda: https://gerrit.wikimedia.org/r/310621 would fix it [19:57:48] yuvipanda: that drop the camelcase style rule for all the hiera _backend.rb files [19:57:56] matanya: none of my Scribunto changes went out in wmf.19 [19:58:13] in fact, no change to Scribunto went out in wmf.19 [19:58:25] legoktm: really? i was under the impression they were [19:58:35] (03PS4) 10Yuvipanda: labs: Introduce httpyaml backend for labtest [puppet] - 10https://gerrit.wikimedia.org/r/310619 (https://phabricator.wikimedia.org/T133412) [19:58:35] thanks hashar [19:58:49] https://www.mediawiki.org/wiki/MediaWiki_1.28/wmf.19 Scribunto isn't listed, hence no changes [19:58:59] yuvipanda: maybe some rules should be dropped [19:59:01] (might've had localization updates) [19:59:38] thanks legoktm will play around, and sorry for pointing ;) [19:59:51] matanya hi, you can see https://www.mediawiki.org/wiki/MediaWiki_1.28/wmf.19 [19:59:54] so the infobox on https://he.wikipedia.beta.wmflabs.org/wiki/%D7%A7%D7%99%D7%95%D7%91 should be on the right? [19:59:57] er, on the left? [19:59:58] for all the updates done to the branch [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160914T2000). [20:00:08] yes legoktm [20:00:15] (03CR) 10jenkins-bot: [V: 04-1] labs: Introduce httpyaml backend for labtest [puppet] - 10https://gerrit.wikimedia.org/r/310619 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [20:00:15] nothing for ORES [20:01:44] (03CR) 10Yuvipanda: [C: 032] rubocop: ignore underscore in class of puppet backends [puppet] - 10https://gerrit.wikimedia.org/r/310621 (owner: 10Hashar) [20:01:52] huh, the firefox inspector really doesn't like that infobox [20:01:58] (03PS5) 10Yuvipanda: labs: Introduce httpyaml backend for labtest [puppet] - 10https://gerrit.wikimedia.org/r/310619 (https://phabricator.wikimedia.org/T133412) [20:02:20] legoktm happens for me [20:02:26] using microsoft edge [20:02:32] I know [20:02:44] ok [20:03:04] !log starting Parsoid deploy [20:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:49] Who do I ask about images at wikitech not rendering? I noticed the last 3 versions of this file aren't rendering - https://wikitech.wikimedia.org/wiki/File:Infrastructure_overview.png - http://imagizer.imageshack.com/img923/2033/IL3MmE.png - (ping elukey as uploader, who might have insights) [20:03:54] (03CR) 10Yuvipanda: [C: 032] labs: Introduce httpyaml backend for labtest [puppet] - 10https://gerrit.wikimedia.org/r/310619 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [20:04:05] yuvipanda: neat! :] [20:04:25] yuvipanda: feel free to disable rubocop rules in the .rubocop.yaml some might just be too pedantic [20:04:42] * yuvipanda nods [20:04:44] thanks hashar [20:05:49] !log update RESTBase to 5ae9a506 - staging [20:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:54] s/some/most/ [20:07:28] matanya: I know what's wrong [20:07:29] matanya what about adding float: left; to .infobox? [20:07:33] and oph [20:07:34] oh [20:07:42] yuvipanda: hashar: i've found this to be a reasonable preset: https://github.com/wikimedia/oojs-ui/blob/master/.rubocop.yml [20:07:47] what is that legoktm ? [20:08:41] matanya: MediaWiki:Common.css isn't being applied [20:08:54] Krinkle: ? ^ [20:09:07] what patch broke it ? [20:09:12] I'm still looking now [20:09:16] holy xx [20:09:45] urls to gerrit from grrrit-wm are rather hard to read on a dark background. [20:10:20] Krinkle file a bug for that one, not sure how we can detect it for dark backgrounds though [20:10:40] paravoid: We only use neutral colors, no need for it to be dark in the first place. [20:10:41] Krinkle: https://phabricator.wikimedia.org/T145673 I've narrowed it down to site.styles not being applied [20:10:48] legoktm: steps to reproduce? [20:10:57] https://he.wikipedia.beta.wmflabs.org/wiki/%D7%A7%D7%99%D7%95%D7%91 [20:11:14] Visit https://he.wikipedia.beta.wmflabs.org/wiki/%D7%A7%D7%99%D7%95%D7%91 observe infobox is on right [20:11:26] Open inspector, look at styles being applied to class=infobox [20:11:43] PROBLEM - restbase endpoints health on xenon is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [20:11:49] Read https://he.wikipedia.beta.wmflabs.org/wiki/%D7%9E%D7%93%D7%99%D7%94_%D7%95%D7%99%D7%A7%D7%99:Common.css which I just imported and note that the .infobox styles that contain float: left are not being applied [20:11:53] !log updated Parsoid to version aed15dda [20:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:25] PROBLEM - restbase endpoints health on cerium is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [20:13:06] !log revert RESTBase is staging to d10d759d42 [20:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:14:10] hashar: https://gerrit.wikimedia.org/r/#/c/310630/ needs to be backported, would you like to do it or should I? [20:14:12] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [20:14:29] legoktm: Can't reproduce locally on vagrant [20:14:40] I can't either [20:14:52] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body: list index out of range [20:14:54] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [20:15:05] Also not reproduced on enwiki beta [20:15:07] only hewiki? [20:15:08] Interestingly, the .infobox styles start with a /* @noflip */ comment, do you think that could be relevant? [20:15:20] legoktm: ahhhhh __METHOD__ :( [20:15:25] I believe it's just hewiki... matanya can you confirm? [20:15:43] Only hewiki beta? Not testwiki or enwiki beta? [20:15:46] any other wiki with wmf19 i can check legoktm ? [20:15:55] matanya: test.wikipedia.org [20:16:12] We could make testwiki an RTL language? [20:16:15] matanya: bookmark https://tools.wmflabs.org/versions/ , open, then click on the version to get a list of wikis [20:16:16] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.151, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [20:16:37] legoktm: it needs to br RTL wiki [20:17:00] thanks hashar it is handy [20:17:13] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [20:17:25] It' not related to the contents of Common.css as the whoel module isn't being loaded [20:18:30] Krinkle: If I make my local wiki $wgLanguageCode = 'he'; I can reproduce [20:18:42] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [20:18:53] My body { background-color: green; } rule fails [20:18:59] neat [20:19:09] time for a git bisect I guess ? :( [20:19:12] too green [20:19:37] mobrovac: we are waiting for a coloroid service! [20:19:47] haha [20:20:18] ok, I'll bisect [20:20:27] can't confirm nor deny on testwiki [20:20:56] I suspect it's because of the localised namespace [20:22:18] Krinkle: bisected to Preload ResourceLoaderWikiModule::getTitleInfo in OutputPage [20:23:11] !log manually raise max_connections on labtestcontrol2001, see T145679 for ticket [20:23:12] T145679: Puppetize mysql on labtest - https://phabricator.wikimedia.org/T145679 [20:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:17] legoktm: OK. I found the problem. [20:23:23] eh.. yuvi beat me to it ^ [20:23:26] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: fix chromium-browser startup script on osmium (was: fix puppet run on osmium ) - https://phabricator.wikimedia.org/T141023#2638194 (10Dzahn) The remaining problem was: Jul 28 01:36:01 osmium chromium[34385]: [0728/013601:ERROR:nss_util.cc(99... [20:24:23] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2638197 (10Dzahn) [20:24:26] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: fix chromium-browser startup script on osmium (was: fix puppet run on osmium ) - https://phabricator.wikimedia.org/T141023#2638196 (10Dzahn) 05Open>03Resolved [20:25:27] 06Operations, 10VisualEditor, 07Performance: fix chromium-browser startup script on osmium (was: fix puppet run on osmium ) - https://phabricator.wikimedia.org/T141023#2484480 (10Dzahn) [20:25:48] Krinkle: ok, I assume you're writing a patch? [20:25:51] legoktm Krinkle https://gerrit.wikimedia.org/r/#/c/309258/ [20:25:58] hashar ^^ [20:26:10] paladox: we know [20:26:25] paladox: they can find a link in git :) [20:26:26] Oh, i thought you didnt know of that patch. Sorry about that [20:27:10] legoktm: Yeah, we use title::getPrefixedText() for the info keys to intersect. But the input is static (always "MediaWiki:Common.css"), they're not normalised to hte same before intersecting [20:27:11] Patch incoming [20:27:31] legoktm: The actual bug is not in this commit, but in the previous one. This one just makes it more obvious. [20:27:36] It probably affects other loading as well. [20:27:41] * legoktm nods [20:28:31] where else should i have seen it apart from infoboxes ? anything affected by special setting in commons.css ? [20:28:39] !log hashar@tin Synchronized php-1.28.0-wmf.19/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameJob.php: Fix LocalRenameJob transaction owner to match JobRunner T143328 T145596 (duration: 00m 48s) [20:28:41] T143328: MW-1.28.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T143328 [20:28:41] T145596: Renames getting stuck on mediawiki.org (Sept 13, 2016) - https://phabricator.wikimedia.org/T145596 [20:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:05] legoktm: I have deployed your rename user hot fix [20:29:13] hashar: thanks [20:30:44] legoktm: Krinkle if you can get test written to cover the issue, that would be quite nice. In case it happens again [20:31:32] 06Operations, 06Discovery, 06Labs, 06Maps, and 3 others: PostgreSQL query planner bug on labsdb1006 - https://phabricator.wikimedia.org/T145599#2638264 (10Yurik) p:05Triage>03Low [20:43:58] Krinkle: legoktm: I guess we want to postpone the group1 to tomorrow dont we ? [20:44:12] I dont mind deploying it again tomorrow then process with group2 an hour or so after [20:45:51] I defer to Krinkle [20:46:03] RECOVERY - Apache HTTP on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.071 second response time [20:46:34] thcipriani: what do we do usually when train has some issue ? :D [20:46:46] I have rollbacked group1 due to a couple issues [20:46:52] RECOVERY - salt-minion processes on mw2198 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:47:00] one was marked as a blocker but I missed it (solved), the others is an issue with css [20:47:07] I am tempted to just postpone and go crash to bed [20:47:16] and push group1 tomorrow a bit earlier [20:47:22] then group2 an hour or so after [20:47:42] is there still an unresolved issue with the CSS? [20:49:11] yep impacting hewikis [20:49:24] MediaWiki:Common.css not being applied apparently [20:49:31] which is hmm... definitely a blocker :] [20:49:39] hashar: it's pretty much up to whoever is running the train, so you get to make the roll forward today or wait for tomorrow call :) [20:49:44] in the past there have been times where we've done 1 and 2 on Thursday due to some blocking issue. [20:50:03] yeah [20:50:10] I think I will just do that [20:50:17] group1 on thursday at usual time [20:50:26] and if all goes fine after an hour group2 [20:50:27] OR [20:50:48] I group2 at 8:59am CEST which technically is still "not friday in SF" [20:51:28] any mailling list(s) to recommend for me to post an announce? I guess simply wikitech is good enough [20:53:05] wikitech is probably good. I could also handle the rollout of group2 tomorrow if it's too late for you. [20:53:37] hey thanks! [20:53:47] will see tomorrow whether I can handle group 2 or not [20:53:54] * hashar writes announce to wikitech-l [20:54:20] matanya: thank you a ton to have quickly reported the infobox issue on hewiki \o/ [20:54:21] kk, I'll make sure I'm available to throw the switch :) [20:55:18] else I can add a cron that does curl --POST https://integration.wikimedia.org/ci/job/prod-scap/buildWithParameters?group=group2&force=yes&ignorerrors=yes [20:57:05] Some Day™ [20:57:44] !log RESTBase update to fd43f3a58 staging [20:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:04] thcipriani, hashar: too hard. just "jouncebot: promote group2" [21:00:17] totally [21:00:21] 'chatops' [21:03:41] heh, group1 and group2 days are not too far off from that command as it is: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Thursday:_group.7B0.2C1.7D_to_all_deploy [21:03:48] deploy-promote all [21:05:21] ok wrote some non sense to wikitech [21:05:48] then after 14+ years, I guess most people already have me in their plonk list [21:07:31] (03PS1) 10Yuvipanda: labtest: actually add httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310669 (https://phabricator.wikimedia.org/T133412) [21:08:20] (03PS2) 10Yuvipanda: labtest: actually add httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310669 (https://phabricator.wikimedia.org/T133412) [21:08:25] (03CR) 10Yuvipanda: [C: 032 V: 032] labtest: actually add httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310669 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [21:09:05] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:09:54] hashar: legoktm: I've finished the test (required a bit of mocking). I'll have a patch in 30min or so. [21:09:57] We shoudl delay yes. [21:10:00] either to later today or tomorrow. [21:10:03] hashar: alway at your service [21:13:02] !log RESTBase update to fd43f3a58 canary on restbase1007 [21:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:57] Krinkle: take your time. I dont mind delaying even more if needed :] [21:16:14] Krinkle: I am gonna crash to bed, will revisit tomorrow! [21:18:22] !log RESTBase update to fd43f3a58 [21:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:36:35] (03Draft1) 10Paladox: puppetmaster: Fix whitespace error [puppet] - 10https://gerrit.wikimedia.org/r/310672 [21:37:09] (03PS2) 10Paladox: puppetmaster: Fix whitespace error [puppet] - 10https://gerrit.wikimedia.org/r/310672 [21:38:50] (03PS3) 10Paladox: puppetmaster: Fix whitespace error [puppet] - 10https://gerrit.wikimedia.org/r/310672 [21:43:18] (03Draft1) 10Paladox: burrow: fix optional parameter listed before required parameter [puppet] - 10https://gerrit.wikimedia.org/r/310674 [21:43:20] (03Draft2) 10Paladox: burrow: fix optional parameter listed before required parameter [puppet] - 10https://gerrit.wikimedia.org/r/310674 [21:50:57] [5e66aae686620598f2d74edf] 2016-09-14 18:06:39: Fatal exception of type "JobQueueError" undeleting on wikitech [21:51:45] hashar ^^ [21:52:45] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:53:20] (03PS1) 10Yuvipanda: labs: More fixups to htpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310676 [21:53:34] (03PS2) 10Yuvipanda: labs: More fixups to htpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310676 [21:54:57] (03CR) 10jenkins-bot: [V: 04-1] labs: More fixups to htpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310676 (owner: 10Yuvipanda) [21:56:35] thught so [21:57:06] (03PS3) 10Yuvipanda: labs: More fixups to htpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310676 [21:58:39] (03CR) 10Yuvipanda: [C: 032] labs: More fixups to htpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310676 (owner: 10Yuvipanda) [21:58:47] (03PS4) 10Yuvipanda: labs: More fixups to htpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310676 [21:58:50] (03CR) 10Yuvipanda: [V: 032] labs: More fixups to htpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310676 (owner: 10Yuvipanda) [22:06:41] paladox: legoktm cant look at that JobQueueError on wikitech :( I am more than half asleep already :( [22:06:58] hashar: go to sleep :) I already looked into it [22:07:07] :p [22:07:41] (03PS1) 10Yuvipanda: labs: More fixups to httpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310678 [22:08:43] (03CR) 10jenkins-bot: [V: 04-1] labs: More fixups to httpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310678 (owner: 10Yuvipanda) [22:10:33] (03PS2) 10Yuvipanda: labs: More fixups to httpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310678 [22:13:17] (03CR) 10Yuvipanda: [C: 032] labs: More fixups to httpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310678 (owner: 10Yuvipanda) [22:16:15] 06Operations, 07Upstream: Trusty: debug information found in "/usr/lib/debug//usr/lib/php5/20121212/mysql.so" does not match "/usr/lib/php5/20121212/mysql.so" (CRC mismatch). - https://phabricator.wikimedia.org/T145706#2638735 (10hashar) [22:16:39] (03CR) 10Ottomata: [C: 032] burrow: fix optional parameter listed before required parameter [puppet] - 10https://gerrit.wikimedia.org/r/310674 (owner: 10Paladox) [22:16:44] (03PS3) 10Ottomata: burrow: fix optional parameter listed before required parameter [puppet] - 10https://gerrit.wikimedia.org/r/310674 (owner: 10Paladox) [22:17:35] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:19:23] legoktm: hashar: greg-g: OK to deploy fix for regression https://phabricator.wikimedia.org/T145673 to wmf.19? [22:20:05] sounds good to me [22:24:42] Krinkle: cherry pick to wmf.19 sure [22:25:14] I am sleeping really. Probably better to hold group1 [22:25:27] but cherry pick to wmf.19 so it lands on group0 , yeah that sounds good [22:26:38] and TestingAccessWrapper is quite nice :] [22:27:15] (03PS1) 10Volans: auto-reimage: improved output [puppet] - 10https://gerrit.wikimedia.org/r/310680 (https://phabricator.wikimedia.org/T143536) [22:27:27] Krinkle: https://gerrit.wikimedia.org/r/#/c/310679/1/tests/phpunit/includes/resourceloader/ResourceLoaderWikiModuleTest.php has a typo maybe [22:28:06] all I can spot for now. Sorry really gotta sleep :/ [22:28:48] (03PS2) 10Volans: auto-reimage: improved output [puppet] - 10https://gerrit.wikimedia.org/r/310680 (https://phabricator.wikimedia.org/T143536) [22:29:01] I guess that is the whole point of the patch [22:29:06] bah *vanishes* [22:29:55] <_joe_> yuvipanda: I was looking at that hiera backend [22:30:15] (03CR) 10Volans: [C: 032] auto-reimage: improved output [puppet] - 10https://gerrit.wikimedia.org/r/310680 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [22:30:21] <_joe_> you don't get any meaningful info back from the http server? Like Last-Modified? [22:31:17] joe: nope [22:31:43] (03PS1) 10Andrew Bogott: Puppet panel: Add a missing /div [puppet] - 10https://gerrit.wikimedia.org/r/310682 [22:31:45] (03PS1) 10Andrew Bogott: Puppet panel: Add per-prefix config tabs [puppet] - 10https://gerrit.wikimedia.org/r/310683 (https://phabricator.wikimedia.org/T91990) [22:32:51] <_joe_> yuvipanda: wtf? even a stupid apache with files would do the right thing [22:33:15] (03CR) 10jenkins-bot: [V: 04-1] Puppet panel: Add per-prefix config tabs [puppet] - 10https://gerrit.wikimedia.org/r/310683 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [22:33:24] shrug maybe I can add it later [22:34:08] joe: do I need to do anything other than restart puppetmaster for the changes to take effect? [22:34:46] <_joe_> which changes, and what puppetmaster? [22:35:20] I'm testing this on labtest [22:35:20] changes to the hiera backend [22:35:25] (such as https://gerrit.wikimedia.org/r/#/c/310678/) [22:35:50] <_joe_> yes, you need to restart the puppetmaster [22:36:03] !log krinkle@tin Synchronized php-1.28.0-wmf.19/includes/resourceloader/ResourceLoaderWikiModule.php: T145673 (duration: 00m 47s) [22:36:04] T145673: deploy group1 wikis to 1.28.0-wmf.19 broke template RTL placement - https://phabricator.wikimedia.org/T145673 [22:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:31] but just that, right? so I can cherry-pick changes onto /var/lib/git and then restart puppetmaster to test? [22:38:42] (03PS2) 10Andrew Bogott: Puppet panel: Add per-prefix config tabs [puppet] - 10https://gerrit.wikimedia.org/r/310683 (https://phabricator.wikimedia.org/T91990) [22:38:44] (03PS2) 10Andrew Bogott: Puppet panel: remove an extra /div [puppet] - 10https://gerrit.wikimedia.org/r/310682 [22:39:54] (03CR) 10jenkins-bot: [V: 04-1] Puppet panel: Add per-prefix config tabs [puppet] - 10https://gerrit.wikimedia.org/r/310683 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [22:41:09] (03PS3) 10Andrew Bogott: Puppet panel: Add per-prefix config tabs [puppet] - 10https://gerrit.wikimedia.org/r/310683 (https://phabricator.wikimedia.org/T91990) [22:43:12] (03PS3) 10Andrew Bogott: Puppet panel: remove an extra /div [puppet] - 10https://gerrit.wikimedia.org/r/310682 [22:43:36] (03PS4) 10Andrew Bogott: Puppet panel: Add per-prefix config tabs [puppet] - 10https://gerrit.wikimedia.org/r/310683 (https://phabricator.wikimedia.org/T91990) [22:45:50] (03CR) 10Andrew Bogott: [C: 032] Puppet panel: remove an extra /div [puppet] - 10https://gerrit.wikimedia.org/r/310682 (owner: 10Andrew Bogott) [22:46:13] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:46:21] (03CR) 10Andrew Bogott: [C: 032] Puppet panel: Add per-prefix config tabs [puppet] - 10https://gerrit.wikimedia.org/r/310683 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [22:51:12] (03PS1) 10Dzahn: tendril: fix 'optional before required parameter' [puppet] - 10https://gerrit.wikimedia.org/r/310685 [22:52:52] (03PS1) 10Yuvipanda: labs: More fixups to httpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310687 [22:53:21] (03PS2) 10Yuvipanda: labs: More fixups to httpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310687 [22:53:34] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: More fixups to httpyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310687 (owner: 10Yuvipanda) [22:55:09] (03PS2) 10Dzahn: tendril: fix 'optional before required parameter' [puppet] - 10https://gerrit.wikimedia.org/r/310685 [22:55:21] (03CR) 10Dzahn: [C: 032] tendril: fix 'optional before required parameter' [puppet] - 10https://gerrit.wikimedia.org/r/310685 (owner: 10Dzahn) [22:57:49] (03PS4) 10Dzahn: puppetmaster: Fix whitespace error [puppet] - 10https://gerrit.wikimedia.org/r/310672 (owner: 10Paladox) [22:59:14] (03CR) 10Dzahn: [C: 032] puppetmaster: Fix whitespace error [puppet] - 10https://gerrit.wikimedia.org/r/310672 (owner: 10Paladox) [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160914T2300). [23:00:04] matt_flaschen and yurik: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:15] Present [23:01:24] (03PS5) 10Dzahn: snapshot: Fix variable contains an uppercase letter [puppet] - 10https://gerrit.wikimedia.org/r/308355 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [23:02:19] (03PS1) 10Yuvipanda: labs: Use YAML.load instead of .parse for loading YAML [puppet] - 10https://gerrit.wikimedia.org/r/310689 [23:02:47] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4081/" [puppet] - 10https://gerrit.wikimedia.org/r/308355 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [23:03:08] Hello [23:03:13] I can SWAT this evening. [23:03:17] yurik: ping ? [23:03:25] Dereckson, pong [23:03:51] (03CR) 10jenkins-bot: [V: 04-1] labs: Use YAML.load instead of .parse for loading YAML [puppet] - 10https://gerrit.wikimedia.org/r/310689 (owner: 10Yuvipanda) [23:04:49] yuvipanda: is "ircyall" not used anymore? [23:05:11] was wondering about the comments on https://gerrit.wikimedia.org/r/#/c/308311/ [23:05:29] (03PS2) 10Yuvipanda: labs: Use YAML.load instead of .parse for loading YAML [puppet] - 10https://gerrit.wikimedia.org/r/310689 [23:05:45] mutante: yeah, I think you can kill the module / roles [23:05:50] I think the project was deleted recently to [23:07:04] yuvipanda: alright, thanks, will do it in 2 steps [23:07:41] (03CR) 10Yuvipanda: [C: 032] labs: Use YAML.load instead of .parse for loading YAML [puppet] - 10https://gerrit.wikimedia.org/r/310689 (owner: 10Yuvipanda) [23:07:47] (03PS2) 10Dzahn: ircyall: move role to module/role [puppet] - 10https://gerrit.wikimedia.org/r/308311 (owner: 10Hashar) [23:08:18] yurik: okay I've CR+2, we wait zuul [23:08:27] thx [23:09:04] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309346 (https://phabricator.wikimedia.org/T131957) (owner: 10Mattflaschen) [23:09:18] (03PS3) 10Dereckson: Add logging channel for NewUserMessage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309346 (https://phabricator.wikimedia.org/T131957) (owner: 10Mattflaschen) [23:09:26] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309346 (https://phabricator.wikimedia.org/T131957) (owner: 10Mattflaschen) [23:09:42] (03CR) 10Dzahn: [C: 032] "16:05 < yuvipanda> mutante: yeah, I think you can kill the module / roles" [puppet] - 10https://gerrit.wikimedia.org/r/308311 (owner: 10Hashar) [23:09:52] (03Merged) 10jenkins-bot: Add logging channel for NewUserMessage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309346 (https://phabricator.wikimedia.org/T131957) (owner: 10Mattflaschen) [23:10:48] Krinkle: https://tendril.wikimedia.org/host/view/db1087.eqiad.wmnet/3306 better :) [23:12:22] 06Operations, 07Puppet: Puppetize ircyall & set up instance appropriately - https://phabricator.wikimedia.org/T1357#2638854 (10Dzahn) 16:05 < mutante> yuvipanda: is "ircyall" not used anymore? 16:05 < mutante> was wondering about the comments on https://gerrit.wikimedia.org/r/#/c/308311/ 16:05 < yuvipanda> mut... [23:13:14] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:14:00] matt_flaschen: live on mw1099 [23:14:30] (03PS1) 10Dzahn: delete ircyall module and role [puppet] - 10https://gerrit.wikimedia.org/r/310692 (https://phabricator.wikimedia.org/T1357) [23:16:16] (03PS1) 10Andrew Bogott: Puppet Panel: rename panel files [puppet] - 10https://gerrit.wikimedia.org/r/310693 [23:16:38] PROBLEM - puppet last run on elastic2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:18:35] (03CR) 10Dzahn: "yep, but since the class name doesnt change the labs instances should also not be influenced at all" [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [23:18:53] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: rename panel files [puppet] - 10https://gerrit.wikimedia.org/r/310693 (owner: 10Andrew Bogott) [23:19:12] (03PS2) 10Dzahn: role: move mediawiki::install to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [23:23:16] (03CR) 10Dzahn: [C: 032] role: move mediawiki::install to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [23:24:21] Dereckson, works on mw1099, thanks. [23:31:20] matt_flaschen: ack'ed [23:31:56] I was looking about 3 Invalid parameter for message "logentry-massmessage-failure": a:1:{i:0;s:13:"edit-conflict";} in /srv/mediawiki/php-1.28.0-wmf.18/inc [23:32:00] ludes/Message.php on line 1114 [23:32:16] but that doesn't seem to be related to your change [23:32:34] Dereckson, no, shouldn't be related. [23:32:38] that's a really old bug [23:32:50] we fixed the source, but if someone views a bad log entry it'll become a warning again [23:32:55] occur when MassMessage extension edits a page someone else is editing? [23:33:03] oh [23:33:38] there's a bug for it somewhere [23:33:46] we should write a cleanup script soon(TM) [23:33:47] * legoktm afk [23:34:40] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add logging channel for NewUserMessage (T131957) (duration: 00m 47s) [23:34:41] T131957: NewUserMessage should handle Flow properly; affects gomwiki/Konkani Wikipedia and kabwiki - https://phabricator.wikimedia.org/T131957 [23:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:26] (03CR) 10Dzahn: "of those instances in the watroles link i tried 4, first one had an unrelated puppet error (but this change is no-op), 2 were "No route to" [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [23:38:01] Thanks, Dereckson [23:38:23] Testing now [23:38:24] You're welcome. [23:38:30] yurik: live on mw1099 [23:38:36] checking [23:40:27] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2638931 (10Paladox) [23:40:46] Dereckson, verified on terbium. [23:42:15] ok [23:43:03] Dereckson, all good [23:43:24] (03CR) 10Dzahn: "tried 2 more. 1 x "packet_write_wait: Connection to UNKNOWN port 0: Broken pipe" and 1 x "channel 0: open failed: connect failed: No route" [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [23:43:37] okay, syncing [23:43:50] RECOVERY - puppet last run on elastic2002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [23:43:53] !log dereckson@tin Synchronized php-1.28.0-wmf.19/extensions/Kartographer/includes/Tag/TagHandler.php: Always serve all the data on preview (T145615, 1/2) (duration: 00m 47s) [23:43:54] T145615: Preview with does not work - https://phabricator.wikimedia.org/T145615 [23:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:43] !log dereckson@tin Synchronized php-1.28.0-wmf.19/extensions/Kartographer/tests/phpunit/KartographerTest.php: Always serve all the data on preview (T145615, 2/2, no-op part) (duration: 00m 50s) [23:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:15] yurik: here you are ^ [23:45:37] thx! [23:45:56] (03CR) 10Paladox: [C: 031] delete ircyall module and role [puppet] - 10https://gerrit.wikimedia.org/r/310692 (https://phabricator.wikimedia.org/T1357) (owner: 10Dzahn) [23:47:53] (03Draft1) 10Paladox: mariadb: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310702 (https://phabricator.wikimedia.org/T115348) [23:48:30] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2638957 (10Dzahn) [23:48:46] (03PS2) 10Paladox: mariadb: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310702 (https://phabricator.wikimedia.org/T115348) [23:49:21] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10Dzahn) [23:50:08] (03CR) 10Dzahn: [C: 031] mariadb: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310702 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [23:50:08] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2638961 (10Paladox) [23:50:22] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:50:41] (03CR) 10Dzahn: "This is part of the "Audit uses of package=>latest" ticket" [puppet] - 10https://gerrit.wikimedia.org/r/310702 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [23:52:35] (03Draft1) 10Paladox: mediawiki_singlenode: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310703 (https://phabricator.wikimedia.org/T115348) [23:52:37] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2638965 (10Dzahn) [23:53:24] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2638967 (10Dzahn) [23:53:48] (03PS2) 10Paladox: mediawiki_singlenode: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310703 (https://phabricator.wikimedia.org/T115348) [23:55:14] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2638969 (10Paladox) [23:58:27] (03PS1) 10Dzahn: contint: don't use ensure 'latest' with php packages [puppet] - 10https://gerrit.wikimedia.org/r/310704 (https://phabricator.wikimedia.org/T115348)