[00:27:15] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:50:59] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [00:53:52] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2704675 (10yuvipanda) For tools at least, I'd like to continue using upstream's packages. We don't have the bandwidth to do this backp... [01:28:30] 06Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 07Regression: Favicon broken on doc.wikimedia.org and integration.wikimedia.org (HTTP 500) - https://phabricator.wikimedia.org/T147814#2704346 (10BBlack) The response lacks `Content-Length` because it's sent with `Transfer-Encoding: chunked`... [02:34:03] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 15m 28s) [02:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:48] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 11 02:38:48 UTC 2016 (duration 4m 45s) [02:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [02:46:38] PROBLEM - thumbor@8840 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8840 is inactive [02:48:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:01:59] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:28:11] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:18:36] PROBLEM - HP RAID on ms-be1022 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [05:21:07] RECOVERY - HP RAID on ms-be1022 is OK: OK: Slot 3: OK: 2I:4:2, 2I:4:1, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [05:22:37] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Stirring The Pot, and 2 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2704815 (10AndyRussG) I've been able to trigger this error a few times on the beta cluster by creating a... [06:10:24] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2704822 (10Joe) >>! In T147718#2703679, @Ottomata wrote: > - Does this mean that a single role can no longer be used by both labs and production? No, b... [06:17:14] !log Deploying schema change S4 commonswiki.revision - T147305 [06:17:15] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [06:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:31:32] !Dropping memory tables hitcounter, _counters from S4 hosts - T132837 [06:31:32] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [06:35:27] PROBLEM - HP RAID on ms-be1022 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [06:38:06] RECOVERY - HP RAID on ms-be1022 is OK: OK: Slot 3: OK: 2I:4:2, 2I:4:1, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [06:40:56] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ack-grep] [06:41:53] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2704840 (10Joe) @yuvipanda I'm ok with that (well, not really ok, but it's not my call), but then please let's rename `docker::engine... [06:44:32] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM but please check with your team if they use X-Powered-By anywhere; also check our VCL for remnants from the HHVM migration." [puppet] - 10https://gerrit.wikimedia.org/r/314519 (owner: 10Elukey) [06:50:29] !log installing django security updates [06:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:54:15] (03PS8) 10Giuseppe Lavagetto: hiera: always search for the full key [puppet] - 10https://gerrit.wikimedia.org/r/312206 (https://phabricator.wikimedia.org/T147403) [06:55:43] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: always search for the full key [puppet] - 10https://gerrit.wikimedia.org/r/312206 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [07:01:42] 07Puppet, 10Beta-Cluster-Infrastructure: puppet failure on deployment-phab0[12] due to missing expected puppet:///modules/phabricator/sshd-phab.service - https://phabricator.wikimedia.org/T147818#2704869 (10hashar) @mmodell what are those deployment-phab01 and deployment-phab02 instances? From the name that se... [07:07:39] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:14:02] (03CR) 10Elukey: "The only comment received for the moment is related to the hiera_array() call for the extended_option array: we could instead hardcode a f" [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [07:18:44] (03CR) 10Jcrespo: [C: 04-1] "I do not think this is something we should do right now, if someone searches for "C programming", I think it is ok to only return results " [puppet] - 10https://gerrit.wikimedia.org/r/315057 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [07:21:29] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:23:52] moritzm: should we do the tin.eqiad.wmnet reimage this morniing? [07:33:22] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2704917 (10Joe) Back to the main topic: docker 1.11 seems to need a newer protobuf go library than the one we have on jessie/backports... [07:34:09] (03PS1) 10Giuseppe Lavagetto: hiera: convert expand_path hierarchies to use full key [puppet] - 10https://gerrit.wikimedia.org/r/315200 (https://phabricator.wikimedia.org/T147403) [07:36:38] hashar: let's take that to -releng [07:39:10] (03PS1) 10Alexandros Kosiaris: gallium: Open up the rsync port from contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315201 [07:40:31] hashar: ^ [07:41:20] (03PS1) 10Giuseppe Lavagetto: hiera: complete transition in nuyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/315202 (https://phabricator.wikimedia.org/T147403) [07:44:09] (03CR) 10Hashar: [C: 031] "Great, thank you :]" [puppet] - 10https://gerrit.wikimedia.org/r/315201 (owner: 10Alexandros Kosiaris) [07:45:39] (03PS1) 10Giuseppe Lavagetto: Convert hiera to the form expected by the new backend [labs/private] - 10https://gerrit.wikimedia.org/r/315204 (https://phabricator.wikimedia.org/T147403) [07:45:44] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [07:54:05] (03PS1) 10Hashar: Switch primary deployment server from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/315205 (https://phabricator.wikimedia.org/T144578) [07:56:07] !log reimaging mw1017 to jessie (test application server in eqiad) [07:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:57:04] (03CR) 1020after4: [C: 031] Scap: modify deploy-local arguments [puppet] - 10https://gerrit.wikimedia.org/r/315139 (https://phabricator.wikimedia.org/T146602) (owner: 10Thcipriani) [08:05:48] 06Operations, 05codfw-rollout: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#2704933 (10akosiaris) 05Open>03Resolved This has now been resolved [08:09:55] 06Operations, 07Puppet, 05Goal, 05Puppet-infrastructure-modernization: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2704951 (10akosiaris) 05Open>03Resolved a:03akosiaris The goal has been achieved, resolving this. [08:10:06] 06Operations, 07Puppet, 05Goal, 05Puppet-infrastructure-modernization: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2704954 (10akosiaris) [08:10:21] 06Operations: Install puppetDB at WMF - https://phabricator.wikimedia.org/T139476#2704955 (10akosiaris) [08:10:41] 06Operations, 07Puppet, 05Goal, 05Puppet-infrastructure-modernization: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2433318 (10akosiaris) [08:10:43] 06Operations: Install puppetDB at WMF - https://phabricator.wikimedia.org/T139476#2433419 (10akosiaris) 05Open>03Resolved a:03akosiaris PuppetDB has been installed in WMF, works fine, resolving [08:10:52] !Dropping memory tables hitcounter, _counters from S5 master (db1049) - T132837 [08:10:52] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [08:11:01] 06Operations, 10hardware-requests: EQIAD: (2) hardware access request for PUPPET - https://phabricator.wikimedia.org/T142218#2704960 (10akosiaris) 05Open>03Resolved a:03akosiaris Resolving. [08:11:27] 06Operations, 10hardware-requests: EQIAD: (2) hardware access request for PUPPET - https://phabricator.wikimedia.org/T142218#2704967 (10akosiaris) [08:12:18] 06Operations, 07Puppet, 10Monitoring, 13Patch-For-Review: Puppet agent icinga checks need better logic - https://phabricator.wikimedia.org/T143099#2704968 (10akosiaris) 05Open>03Resolved a:03akosiaris The problem stated in this task has been fixed by https://gerrit.wikimedia.org/r/305630. Resolving [08:14:14] (03PS1) 10Alexandros Kosiaris: icinga: Fix ocg group monitoring name [puppet] - 10https://gerrit.wikimedia.org/r/315207 [08:14:52] (03CR) 10jenkins-bot: [V: 04-1] icinga: Fix ocg group monitoring name [puppet] - 10https://gerrit.wikimedia.org/r/315207 (owner: 10Alexandros Kosiaris) [08:15:00] (03PS3) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [08:15:02] (03PS2) 10Alexandros Kosiaris: icinga: Fix ocg group monitoring name [puppet] - 10https://gerrit.wikimedia.org/r/315207 [08:15:02] akosiaris: I broke it sorry [08:15:04] (03PS3) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [08:15:06] (03PS3) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [08:15:08] (03PS2) 10Alexandros Kosiaris: icinga: Kill hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315090 [08:15:25] (03CR) 10jenkins-bot: [V: 04-1] icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 (owner: 10Alexandros Kosiaris) [08:15:31] (03CR) 10jenkins-bot: [V: 04-1] icinga: Fix ocg group monitoring name [puppet] - 10https://gerrit.wikimedia.org/r/315207 (owner: 10Alexandros Kosiaris) [08:16:00] (03CR) 10jenkins-bot: [V: 04-1] icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 (owner: 10Alexandros Kosiaris) [08:16:25] 06Operations: Make Puppet run NICEd on all servers - https://phabricator.wikimedia.org/T78848#2704973 (10akosiaris) 05Open>03declined Despite all the history in this task, we 've had the capability to enable this for more than 2 years now and seem to not really care. Hence I 'll resolve as declined. Feel fre... [08:16:37] (03CR) 10jenkins-bot: [V: 04-1] icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 (owner: 10Alexandros Kosiaris) [08:16:49] (03CR) 10jenkins-bot: [V: 04-1] icinga: Kill hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315090 (owner: 10Alexandros Kosiaris) [08:17:39] 06Operations, 13Patch-For-Review: turn lldp info into puppet facts, mention in MOTD - https://phabricator.wikimedia.org/T84518#2704976 (10akosiaris) 05Open>03Resolved a:03akosiaris Done for quite some time now in various steps. Resolving [08:18:17] 06Operations, 07Puppet, 07Need-volunteer: MaxClients on puppetmaster - https://phabricator.wikimedia.org/T97466#2704980 (10akosiaris) 05Open>03declined Declined for now. If we ever get that problem again with the new infrastructure, we can revisit this [08:18:27] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/315207 (owner: 10Alexandros Kosiaris) [08:18:52] heh [08:19:13] I broke the CI part that does the merge commits bah :(( [08:19:50] ERROR: content conflict in modules/monitoring/manifests/group.pp [08:19:54] does not make any sense to me :( [08:20:08] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2704983 (10Gilles) [08:20:10] 06Operations, 06Performance-Team, 10Thumbor: thumbor: Some video files not recognized - https://phabricator.wikimedia.org/T147417#2704982 (10Gilles) 05Open>03Resolved [08:22:01] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2353511 (10akosiaris) I think the log passthrough problem has been resolved as well and now services get a syslog.log file as well next to the main l... [08:22:09] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/315207 (owner: 10Alexandros Kosiaris) [08:23:47] (03PS1) 10Gilles: Increase Thumbor HTTP_LOADER_REQUEST_TIMEOUT to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/315208 [08:25:24] akosiaris: can you try rebasing your serie ? [08:26:51] PROBLEM - HP RAID on ms-be1022 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [08:28:05] (03PS4) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [08:28:08] (03PS3) 10Alexandros Kosiaris: icinga: Fix ocg group monitoring name [puppet] - 10https://gerrit.wikimedia.org/r/315207 [08:28:10] (03PS4) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [08:28:11] (03PS4) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [08:28:12] hashar: let's see [08:28:14] (03PS3) 10Alexandros Kosiaris: icinga: Kill hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315090 [08:28:41] akosiaris: maybe there was actually a merge conflict of some sort :] [08:28:46] changes are processing [08:29:31] RECOVERY - HP RAID on ms-be1022 is OK: OK: Slot 3: OK: 2I:4:2, 2I:4:1, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [08:31:11] 06Operations, 10RESTBase, 06Services, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2704994 (10mobrovac) 05Open>03Resolved a:03mobrovac Yup. We haven't seen this happening since upgrading firejail and pushing logs to a file. Closing. [08:31:50] !log Removing Not needed file from dbstore1001 to free up space (/srv/tmp/db1064.tar.gz.enc) [08:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:40:41] (03PS3) 10Muehlenhoff: Generate stats for monthly package upgrade activity [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) [08:41:41] (03CR) 10jenkins-bot: [V: 04-1] Generate stats for monthly package upgrade activity [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) (owner: 10Muehlenhoff) [08:42:09] 06Operations, 10netops, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#2705007 (10akosiaris) [08:44:16] (03CR) 10Alexandros Kosiaris: [C: 032] gallium: Open up the rsync port from contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315201 (owner: 10Alexandros Kosiaris) [08:44:20] (03PS2) 10Alexandros Kosiaris: gallium: Open up the rsync port from contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315201 [08:44:23] (03CR) 10Alexandros Kosiaris: [V: 032] gallium: Open up the rsync port from contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315201 (owner: 10Alexandros Kosiaris) [08:44:33] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Fix ocg group monitoring name [puppet] - 10https://gerrit.wikimedia.org/r/315207 (owner: 10Alexandros Kosiaris) [08:44:37] (03PS4) 10Alexandros Kosiaris: icinga: Fix ocg group monitoring name [puppet] - 10https://gerrit.wikimedia.org/r/315207 [08:44:39] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Fix ocg group monitoring name [puppet] - 10https://gerrit.wikimedia.org/r/315207 (owner: 10Alexandros Kosiaris) [08:45:38] (03PS2) 10Filippo Giunchedi: Increase Thumbor HTTP_LOADER_REQUEST_TIMEOUT to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/315208 (owner: 10Gilles) [08:47:39] (03CR) 10Filippo Giunchedi: [C: 032] Increase Thumbor HTTP_LOADER_REQUEST_TIMEOUT to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/315208 (owner: 10Gilles) [08:47:52] 06Operations, 10Continuous-Integration-Infrastructure, 10puppet-compiler, 13Patch-For-Review: OSError: [Errno 28] No space left on device on compiler02.puppet3-diffs.eqiad.wmflabs - https://phabricator.wikimedia.org/T143671#2705015 (10hashar) 05Open>03Resolved [08:56:06] PROBLEM - thumbor@8819 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8819 is inactive [08:57:25] ^ taking a look [08:58:46] (03PS1) 10Hashar: nodepool: lower throttling rate to OpenStack API [puppet] - 10https://gerrit.wikimedia.org/r/315214 [08:59:57] RECOVERY - thumbor@8806 service on thumbor1002 is OK: OK - thumbor@8806 is active [09:00:04] akosiaris: Dear anthropoid, the time has come. Please deploy OTRS upgrade to 5.0.13 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161011T0900). [09:00:05] akosiaris: A patch you scheduled for OTRS upgrade to 5.0.13 is about to be deployed. Please be available during the process. [09:00:35] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Kill hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315090 (owner: 10Alexandros Kosiaris) [09:00:39] (03PS4) 10Alexandros Kosiaris: icinga: Kill hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315090 [09:00:50] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Kill hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315090 (owner: 10Alexandros Kosiaris) [09:01:22] aha [09:01:26] RECOVERY - thumbor@8819 service on thumbor1002 is OK: OK - thumbor@8819 is active [09:02:06] RECOVERY - thumbor@8840 service on thumbor1002 is OK: OK - thumbor@8840 is active [09:02:31] (03PS1) 10Muehlenhoff: Update to 4.4.24 [debs/linux44] - 10https://gerrit.wikimedia.org/r/315215 [09:03:05] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.25 (031 comment) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/315088 (owner: 10Gilles) [09:03:58] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/4265/ shows this to be a noop." [puppet] - 10https://gerrit.wikimedia.org/r/315200 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [09:05:21] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:21] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:21] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:58] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:09] PROBLEM - puppet last run on lvs1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:29] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:29] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:30] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:31] we'll need to fetch the puppet fail alert shower umbrella [09:06:35] _joe_: ^ [09:06:36] akosiaris, _joe_ puppermaster restarted? [09:06:37] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:47] PROBLEM - puppet last run on mw2175 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:59] PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:08] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:08] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:08] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:08] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:18] PROBLEM - puppet last run on ganeti2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:20] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:21] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:21] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:21] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:21] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:21] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:22] <_joe_> godog: I didn't submit the patch [09:07:23] wat ? [09:07:30] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:38] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:43] <_joe_> I'm looking [09:07:44] sigh [09:07:46] I am reverting [09:07:57] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter image on Nagios_host[elastic1031] at /etc/puppet/modules/monitoring/manifests/host.pp:68 on node elastic1031.eqiad.wmnet [09:07:59] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:08:00] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:08:01] akosiaris: [09:08:08] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:08:10] <_joe_> volans: he's on it [09:08:20] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:08:21] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:08:23] saw after pasting :( [09:08:24] ah [09:08:28] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:08:51] <_joe_> meh I'll wait before merging my change [09:09:25] (03PS1) 10Alexandros Kosiaris: Revert "icinga: Kill hostextinfo" [puppet] - 10https://gerrit.wikimedia.org/r/315216 [09:10:18] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Turns out it caused a havoc of Error 400 on SERVER: Invalid parameter image on Nagios_host" [puppet] - 10https://gerrit.wikimedia.org/r/315216 (owner: 10Alexandros Kosiaris) [09:10:50] I 'll have to revisit this [09:12:21] ah yes.. I am an idiot [09:13:03] <_joe_> akosiaris: if you do that, please do it also on naggen2 [09:13:22] you mean kill hostextinfo ? [09:13:29] yes but on a later patchset [09:13:34] <_joe_> ok [09:13:47] <_joe_> so, can I merge my insanely dangerous patch too? [09:13:55] I 'd wait [09:15:06] (03PS1) 10Alexandros Kosiaris: Revert "Revert "icinga: Kill hostextinfo"" [puppet] - 10https://gerrit.wikimedia.org/r/315217 [09:28:39] (03PS1) 10Ema: cache_text frontend VCL: backend_fetch vs misspass [puppet] - 10https://gerrit.wikimedia.org/r/315219 (https://phabricator.wikimedia.org/T131503) [09:28:50] (03CR) 10Hashar: "Wikitech doc (outdated?) https://wikitech.wikimedia.org/wiki/Switch_Datacenter/DeploymentServer" [puppet] - 10https://gerrit.wikimedia.org/r/315205 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [09:32:10] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, some suggestions" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315062 (owner: 10Gilles) [09:33:52] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:33:52] RECOVERY - puppet last run on elastic2015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:33:52] RECOVERY - puppet last run on mc2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:33:52] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:33:52] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:33:53] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:33:53] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:33:54] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:33:54] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:33:57] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:33:58] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:33:59] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:33:59] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:34:00] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:01] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:08] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:34:08] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:08] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:08] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:34:08] RECOVERY - puppet last run on ms-be2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:08] RECOVERY - puppet last run on mw2205 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:32] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:34:32] RECOVERY - puppet last run on mw2175 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:34:32] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:34:33] RECOVERY - puppet last run on db1038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:33] RECOVERY - puppet last run on mw2115 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:33] RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:34:33] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:34:34] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:34:44] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:34:52] (03CR) 10Volans: "A couple of comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) (owner: 10Muehlenhoff) [09:34:53] RECOVERY - puppet last run on mc2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:53] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:34:53] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:53] RECOVERY - puppet last run on mw2097 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:34:54] RECOVERY - puppet last run on mw2215 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:34:54] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:54] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:35:03] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:35:03] RECOVERY - puppet last run on elastic2023 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:35:03] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:35:04] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:12] RECOVERY - puppet last run on es2011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:13] RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:35:13] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:13] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:14] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:35:14] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:35:15] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:35:15] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:35:16] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:35:23] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:35:23] RECOVERY - puppet last run on db1080 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:35:23] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:35:24] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:35:24] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:24] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:35:27] (03PS4) 10Muehlenhoff: Generate stats for monthly package upgrade activity [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) [09:35:33] RECOVERY - puppet last run on mw2159 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:35:33] RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:35:34] RECOVERY - puppet last run on mc1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:35:34] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:35:34] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:35:34] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:43] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:35:43] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:35:43] RECOVERY - puppet last run on lvs1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:35:46] RECOVERY - puppet last run on mc2011 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:35:46] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:46] RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:35:46] RECOVERY - puppet last run on mw2180 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:46] RECOVERY - puppet last run on wtp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:35:46] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:35:53] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:53] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:55] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:35:56] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:36:02] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:36:02] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:36:02] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:03] RECOVERY - puppet last run on mw2124 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:36:03] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:03] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:36:03] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:04] RECOVERY - puppet last run on scandium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:05] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:36:05] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:36:06] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:06] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:36:13] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:36:13] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:14] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:36:15] RECOVERY - puppet last run on mw2213 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:36:27] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:36:27] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:32] RECOVERY - puppet last run on db1085 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:36:32] RECOVERY - puppet last run on lvs1012 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:36:32] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:36:32] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:33] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:36:33] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:36:33] RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:33] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:36:34] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:43] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:36:43] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:44] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:36:45] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:45] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:36:45] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:36:53] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:36:53] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:36:54] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:36:54] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:55] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:36:55] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:55] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:56] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:37:02] RECOVERY - puppet last run on db1087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:04] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:37:04] RECOVERY - puppet last run on elastic2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:04] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:04] RECOVERY - puppet last run on hassaleh is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:13] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:37:13] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:37:13] RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:13] RECOVERY - puppet last run on prometheus1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:13] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:37:14] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:14] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:37:23] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:37:23] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:24] RECOVERY - puppet last run on restbase-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:24] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:37:24] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:37:24] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:25] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:25] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:25] RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:37:32] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:37:33] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:37:34] RECOVERY - puppet last run on wtp2013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:37:36] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:37:36] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:36] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:37:36] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:37] RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:37:42] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:37:42] RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:37:42] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:43] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:37:43] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:37:43] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:43] RECOVERY - puppet last run on mw2230 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:44] RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:37:54] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:55] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:37:55] RECOVERY - puppet last run on mw2224 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:37:55] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:37:56] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:37:56] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:37:56] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:37:56] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:38:03] RECOVERY - puppet last run on dbstore2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:38:03] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:38:03] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:38:03] RECOVERY - puppet last run on mc1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:38:03] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:38:03] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:38:03] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:38:04] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:38:04] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:38:13] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:38:13] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:38:13] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:38:13] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:38:13] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:38:14] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:39:08] (03CR) 10Mobrovac: [C: 04-1] "Minor comments in-lined, otherwise LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315122 (owner: 10Elukey) [09:40:33] (03CR) 10Ema: [C: 032] cache_text frontend VCL: backend_fetch vs misspass [puppet] - 10https://gerrit.wikimedia.org/r/315219 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [09:41:57] 07Puppet, 10Beta-Cluster-Infrastructure: puppet failure on deployment-phab0[12] due to missing expected puppet:///modules/phabricator/sshd-phab.service - https://phabricator.wikimedia.org/T147818#2705071 (10mmodell) @hashar: They are not for hosting phabricator per se, but rather for testing scap deployment of... [09:42:18] (03PS1) 10Muehlenhoff: Point deployment servers to mira [dns] - 10https://gerrit.wikimedia.org/r/315221 [09:44:00] (03PS5) 10Muehlenhoff: Generate stats for monthly package upgrade activity [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) [09:45:29] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2705080 (10akosiaris) 05Open>03Resolved The upgrade has been successful, we are now running 5.0.13. Resolving this [09:45:46] (03PS1) 10Elukey: Set SyslogIdentifier in Pivot's systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/315222 (https://phabricator.wikimedia.org/T138262) [09:45:59] (03PS5) 10Gilles: Make thumbor use a temp folder controlled by systemd-tmpfiles instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 [09:46:13] (03CR) 10Gilles: Make thumbor use a temp folder controlled by systemd-tmpfiles instead of /tmp (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315062 (owner: 10Gilles) [09:47:27] <_joe_> akosiaris: I can go on with my changes I guess [09:47:42] (03PS2) 10Giuseppe Lavagetto: hiera: convert expand_path hierarchies to use full key [puppet] - 10https://gerrit.wikimedia.org/r/315200 (https://phabricator.wikimedia.org/T147403) [09:47:48] (03CR) 10Elukey: [C: 032] Set SyslogIdentifier in Pivot's systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/315222 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [09:48:56] (03PS2) 10Hashar: Switch primary deployment server from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/315205 (https://phabricator.wikimedia.org/T144578) [09:49:30] _joe_: yes, feel free [09:54:18] (03CR) 10Muehlenhoff: [C: 031] Switch primary deployment server from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/315205 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [10:01:38] (03PS5) 10Elukey: Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 [10:01:53] !log switching deployment server to mira [10:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:02:09] (03CR) 10Volans: [C: 031] "Besides the 2 inline comments added previously LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) (owner: 10Muehlenhoff) [10:02:55] (03PS2) 10Muehlenhoff: Point deployment servers to mira [dns] - 10https://gerrit.wikimedia.org/r/315221 [10:04:53] (03PS3) 10Giuseppe Lavagetto: hiera: convert expand_path hierarchies to use full key [puppet] - 10https://gerrit.wikimedia.org/r/315200 (https://phabricator.wikimedia.org/T147403) [10:05:10] (03CR) 10Muehlenhoff: [C: 032] Point deployment servers to mira [dns] - 10https://gerrit.wikimedia.org/r/315221 (owner: 10Muehlenhoff) [10:13:18] (03PS3) 10Hashar: Switch primary deployment server from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/315205 (https://phabricator.wikimedia.org/T144578) [10:13:46] (03PS3) 10Elukey: Remove old wikistats cron script causing cron-spam [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) [10:15:06] (03CR) 10Elukey: "I'll follow up with another patch to add pigz to standard_packages" [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) (owner: 10Elukey) [10:15:14] (03PS4) 10Muehlenhoff: Switch primary deployment server from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/315205 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [10:16:05] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:16:54] (03PS6) 10Elukey: Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 [10:17:43] (03Abandoned) 10Mark Bergsma: add wikilovesmonument.org [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [10:19:56] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:22:27] (03CR) 10Elukey: [C: 032] Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 (owner: 10Elukey) [10:22:33] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:26:20] (03CR) 10Muehlenhoff: [C: 032] Switch primary deployment server from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/315205 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [10:26:24] (03PS5) 10Muehlenhoff: Switch primary deployment server from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/315205 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [10:28:24] (03CR) 10Mobrovac: [C: 04-1] "One more nit, and we're good. Sorry to have missed it earlier." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315122 (owner: 10Elukey) [10:29:14] ah snap you are right Marko [10:37:34] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:38:06] (03PS6) 10Elukey: Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 [10:39:27] (03CR) 10Mobrovac: "LGTM, not +1'ing due to the scap pkg dependency." [puppet] - 10https://gerrit.wikimedia.org/r/315139 (https://phabricator.wikimedia.org/T146602) (owner: 10Thcipriani) [10:39:35] (03PS7) 10Elukey: Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 [10:40:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [10:42:46] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: convert expand_path hierarchies to use full key [puppet] - 10https://gerrit.wikimedia.org/r/315200 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [10:42:55] (03PS4) 10Giuseppe Lavagetto: hiera: convert expand_path hierarchies to use full key [puppet] - 10https://gerrit.wikimedia.org/r/315200 (https://phabricator.wikimedia.org/T147403) [10:43:00] (03CR) 10Mobrovac: [C: 031] "LGTM, and PCC is happy too - https://puppet-compiler.wmflabs.org/4269/" [puppet] - 10https://gerrit.wikimedia.org/r/315122 (owner: 10Elukey) [10:43:06] (03CR) 10Giuseppe Lavagetto: [V: 032] hiera: convert expand_path hierarchies to use full key [puppet] - 10https://gerrit.wikimedia.org/r/315200 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [10:43:23] PROBLEM - puppet last run on dbproxy1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:15] (03CR) 10Elukey: [C: 032] Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 (owner: 10Elukey) [10:45:20] (03PS8) 10Elukey: Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 [10:47:30] mobrovac: merged --^ [10:47:44] cool [10:47:51] thnx elukey for settings the docs up [10:48:23] mobrovac: thanks for the puppet class :) [10:50:26] (03PS1) 10Jcrespo: [WIP] mariadb:Create a systemd unit to be used with our new package [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/315228 [10:50:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:57:48] doing a dummy sync from mira [11:00:09] !log hashar@mira Synchronized README: testing deploy from mira (duration: 02m 38s) [11:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:05] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:11] <_joe_> hashar: it would be more interesting to do a scap pull from say mw1017 [11:04:23] <_joe_> and see changed files, if there is a way to do that [11:04:36] <_joe_> and verify the site works correctly after the pull :) [11:07:24] RECOVERY - puppet last run on dbproxy1007 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [11:09:34] (03PS2) 10Giuseppe Lavagetto: hiera: complete transition in nuyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/315202 (https://phabricator.wikimedia.org/T147403) [11:11:30] (03PS2) 10Jcrespo: mariadb:Create a systemd unit to be used with our new package [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/315228 [11:11:36] !log upgrading nodejs on scb1001 to 4.6.0 [11:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:13:50] (03PS3) 10Jcrespo: mariadb:Create a systemd unit to be used with our new package [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/315228 [11:14:48] !log depooling scb1001 for service restarts [11:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:15:08] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [11:18:19] !log reimaging mw1162.eqiad.wmnet to Debian (MW Jobrunner) [11:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:18:39] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2705246 (10Krd) 05Resolved>03Open [11:19:11] This will be fun [11:19:29] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2691741 (10Krd) There appear to be some database issues with ticket notification settings that need further investigation. Can you please check that? [11:19:35] akosiaris: ^^ [11:20:17] !log stopping and starting mysql on labsdb1008 (not active) for new package/config testing [11:20:26] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:21:07] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315232 (https://phabricator.wikimedia.org/T128546) [11:24:15] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:04] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:28:54] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Convert hiera to the form expected by the new backend [labs/private] - 10https://gerrit.wikimedia.org/r/315204 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [11:30:09] ignore the mobileapps stuff ^^^ [11:31:44] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:33:15] known ^^^ [11:33:20] (03PS3) 10Giuseppe Lavagetto: hiera: complete transition in nuyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/315202 (https://phabricator.wikimedia.org/T147403) [11:34:47] ACKNOWLEDGEMENT - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac investigating master crash [11:35:25] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:35:39] (03PS1) 10Giuseppe Lavagetto: Fix labtest-instances... [labs/private] - 10https://gerrit.wikimedia.org/r/315233 [11:37:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix labtest-instances... [labs/private] - 10https://gerrit.wikimedia.org/r/315233 (owner: 10Giuseppe Lavagetto) [11:37:23] !log repooling scb1001 [11:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:42] !log decomissioning the old AQS cluster - aqs100[123] for good https://gerrit.wikimedia.org/r/#/c/314542/ [11:38:43] (03CR) 10Jcrespo: [C: 04-1] "There is something I am doing wrong- the socket is not create properly, plus mysql starts correctly, but systemd timouts without detecting" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/315228 (owner: 10Jcrespo) [11:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:40:29] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169#2705286 (10hashar) @volans maybe drive this again? I have voted for my preference, but really one way or the other is all fine... [11:40:53] (03PS6) 10Gilles: Make thumbor use a temp folder controlled by systemd-tmpfiles instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 [11:41:03] (03PS4) 10Giuseppe Lavagetto: hiera: complete transition in nuyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/315202 (https://phabricator.wikimedia.org/T147403) [11:42:16] (03PS1) 10Gilles: Point to a folder firejailed thumbor can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/315234 [11:42:33] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2705302 (10akosiaris) Yeah, seems like the upgrade script is creating those. It is a byproduct of the migration process. I 'll remove the duplicate ones. [11:43:04] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [11:43:54] (03PS2) 10Elukey: Decommission the old AQS cluster [puppet] - 10https://gerrit.wikimedia.org/r/314542 (https://phabricator.wikimedia.org/T147461) [11:44:42] (03PS1) 10BBlack: remove role::cache::2layer [puppet] - 10https://gerrit.wikimedia.org/r/315236 [11:44:44] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2705305 (10akosiaris) I 've removed the ones marked Duplicate. Is the issue still present ? [11:46:37] (03CR) 10Elukey: [C: 032] Decommission the old AQS cluster [puppet] - 10https://gerrit.wikimedia.org/r/314542 (https://phabricator.wikimedia.org/T147461) (owner: 10Elukey) [11:46:45] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2705307 (10Krd) Looks good for me, but I'm not sure yet of the old values are ok for all users. [11:47:06] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2705308 (10akosiaris) So, I re-ran the migration scripts because of ``` If you upgrade from OTRS 5 Patch Level 2 or earlier, please run scripts/DBUpdate-to-5.pl once during the upgrade to fix possible issues with dysfunct... [11:52:51] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2705326 (10akosiaris) >>! In T147397#2705307, @Krd wrote: > Looks good for me, but I'm not sure yet of the old values are ok for all users. Those are not touched by the upgrade script as it seems. They have timestamps bac... [11:54:45] (03PS2) 10BBlack: remove role::cache::2layer [puppet] - 10https://gerrit.wikimedia.org/r/315236 [11:55:51] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2705332 (10pajz) Looks good. Looking at the histories of the most recent tickets, only one notification is sent per agent (as it should be), and the duplicates are also gone from the user settings panel. At least in my cas... [11:58:29] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2705335 (10akosiaris) For what is worth, my notifications have remained unaffected. [11:58:37] (03CR) 10BBlack: [C: 032] "NOOP on 8x hosts (direct and non-direct from each of the 4x clusters): https://puppet-compiler.wmflabs.org/4278/" [puppet] - 10https://gerrit.wikimedia.org/r/315236 (owner: 10BBlack) [11:58:44] (03PS3) 10BBlack: remove role::cache::2layer [puppet] - 10https://gerrit.wikimedia.org/r/315236 [11:58:48] (03CR) 10BBlack: [V: 032] remove role::cache::2layer [puppet] - 10https://gerrit.wikimedia.org/r/315236 (owner: 10BBlack) [12:01:17] (03PS2) 10Alexandros Kosiaris: Revert "Revert "icinga: Kill hostextinfo"" [puppet] - 10https://gerrit.wikimedia.org/r/315217 [12:01:56] (03PS3) 10Alexandros Kosiaris: Revert "Revert "icinga: Kill hostextinfo"" [puppet] - 10https://gerrit.wikimedia.org/r/315217 [12:01:59] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "Revert "icinga: Kill hostextinfo"" [puppet] - 10https://gerrit.wikimedia.org/r/315217 (owner: 10Alexandros Kosiaris) [12:13:49] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:14:48] !log upgrading nodejs on scb2001 to 4.6.0 [12:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:18:42] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [12:18:47] (03PS1) 10Alexandros Kosiaris: icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 [12:18:49] (03PS1) 10Alexandros Kosiaris: naggen2: Kill hostextinfo support [puppet] - 10https://gerrit.wikimedia.org/r/315243 [12:18:51] (03PS1) 10Alexandros Kosiaris: Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 [12:18:53] (03PS1) 10Alexandros Kosiaris: icinga: Remove the last vestiges of hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315245 [12:18:55] (03PS1) 10Giuseppe Lavagetto: hiera: convert eqiad as well [puppet] - 10https://gerrit.wikimedia.org/r/315246 [12:19:18] (03CR) 10jenkins-bot: [V: 04-1] icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 (owner: 10Alexandros Kosiaris) [12:19:49] (03CR) 10jenkins-bot: [V: 04-1] naggen2: Kill hostextinfo support [puppet] - 10https://gerrit.wikimedia.org/r/315243 (owner: 10Alexandros Kosiaris) [12:20:26] (03CR) 10jenkins-bot: [V: 04-1] Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 (owner: 10Alexandros Kosiaris) [12:21:07] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [12:21:08] (03CR) 10jenkins-bot: [V: 04-1] icinga: Remove the last vestiges of hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315245 (owner: 10Alexandros Kosiaris) [12:24:01] (03PS2) 10Ema: WIP: Text VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/314716 (https://phabricator.wikimedia.org/T131503) [12:24:02] PROBLEM - cassandra CQL 10.64.0.123:9042 on aqs1001 is CRITICAL: Connection refused [12:24:39] PROBLEM - cassandra service on aqs1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [12:27:17] !log change-prop scb1001: disabled puppet to try and debug why change-prop master is failing on node v4.6.0 [12:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:27:31] ah snap puppet on neon [12:27:34] I forgot it [12:27:39] the aqs alerts are mine [12:27:54] (03PS1) 10Gilles: Add memory limit to Thumbor subprocesses [puppet] - 10https://gerrit.wikimedia.org/r/315248 [12:28:53] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, please mention the systemd-tmpfiles cleanup in the commit message tho" [puppet] - 10https://gerrit.wikimedia.org/r/315062 (owner: 10Gilles) [12:29:35] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, will merge once dependent change is" [puppet] - 10https://gerrit.wikimedia.org/r/315234 (owner: 10Gilles) [12:30:23] !log rearming the keyholder on mira [12:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:44] (03PS7) 10Gilles: Make thumbor use a temp folder controlled by systemd-tmpfiles instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 [12:31:28] PROBLEM - cassandra CQL 10.64.32.175:9042 on aqs1002 is CRITICAL: Connection refused [12:32:17] PROBLEM - cassandra service on aqs1002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [12:33:18] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:33:27] PROBLEM - cassandra CQL 10.64.48.117:9042 on aqs1003 is CRITICAL: Connection refused [12:34:14] (03CR) 10Jcrespo: "the private tmp dolved the socket issue. But "mariadb.service start operation timed out", still." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/315228 (owner: 10Jcrespo) [12:34:25] sigh [12:34:27] sorry for the spam [12:34:28] 06Operations, 10Traffic: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2705399 (10BBlack) [12:35:47] 06Operations, 10Traffic: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2705413 (10BBlack) [12:35:49] (03PS4) 10Jcrespo: mariadb:Create a systemd unit to be used with our new package [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/315228 [12:35:49] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 3 others: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457#2705415 (10BBlack) [12:37:03] 06Operations, 10Traffic: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2705418 (10BBlack) [12:37:04] ok no pending alarms for aqs100[123], I think they fired before puppet completed on neon [12:38:09] 06Operations, 10Traffic: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705444 (10BBlack) [12:38:59] (03PS1) 10DCausse: Elastic@deployment-prep: Remove deployment-elastic08 from the clsuter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315249 (https://phabricator.wikimedia.org/T147777) [12:39:34] !log nodejs reverted to 4.4.6 on scb1001, depooling for service restarts [12:39:38] 06Operations, 10Traffic: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2705399 (10BBlack) p:05Triage>03Normal [12:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:38] moritzm: first jobrunner completed, all good [12:40:50] great [12:41:37] (03CR) 10Marostegui: mariadb:Create a systemd unit to be used with our new package (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/315228 (owner: 10Jcrespo) [12:42:48] ^wait, wait, I have yet to make it work [12:42:51] (03PS1) 10DCausse: [cirrus] remove cirrus BM25 A/B config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315250 [12:43:43] every version has slightly different start options [12:43:56] I have yet to understand the right ones for most versions [12:44:08] !log restarted keyholder-proxy on mira [12:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:44:37] jynus: Sure, I was just curious about it as in I assume it would be restarted on OOM but just checking :) [12:45:19] is not like we are going to blindly replace the systemd unit of all servers [12:45:25] I know [12:46:23] in fact, I am not sure the current version can run with systemd (except on compatibility mode) [12:46:29] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:36] because it lacks the running options [12:47:10] 06Operations, 10Traffic: Use hostnames (not IPs) in deployment-prep varnish app_directors - https://phabricator.wikimedia.org/T147848#2705466 (10BBlack) [12:47:28] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=ptwiki --logwiki=metawiki "Zhyar Merlin" "Zhiar Merlin" [12:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:45] ! on terbium ^ [12:47:51] !log on terbium ^ [12:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:48:20] I think we need to either patch the current version, run a higher version, run mysqld_safe or continue using compatbility mode until we upgrade [12:48:55] all of those are bad options [12:50:30] which would explain why mariad uses init.d on jessie [12:50:46] (both the distro and upstream package) [12:51:49] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:52:36] (03PS2) 10DCausse: [cirrus] remove cirrus BM25 A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315250 (https://phabricator.wikimedia.org/T147508) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161011T1300). [13:00:05] jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:05:41] o/ [13:05:51] ahhhh [13:05:59] jan_drewniak_: wanna deploy that portal change ? [13:06:08] not sure whether you get all the deployment prerequisities though [13:07:44] hashar: hey there, I don't know what's involved in becoming a 'deployer' so I'm not sure about that [13:08:32] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315232 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:08:52] jan_drewniak_: i cant remember where is the precheck list :( [13:08:58] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315232 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:09:07] then there is a deploy once in a while, so it is not a big deal :] [13:10:17] jan_drewniak_: should be on mw1099 now [13:10:35] new color!! :) [13:11:01] I know right :P [13:11:18] hashar: mw1099 looks good to me [13:11:54] 06Operations, 10ChangeProp, 06Services, 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2705503 (10mobrovac) [13:12:03] running the script [13:12:13] I am surprised the portal still use blue [13:12:30] isn't the Design/UX trend to just black gray and white nowadays? [13:13:41] !log hashar@mira Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 01m 46s) [13:13:41] sync almost done [13:13:46] phase2 [13:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:42] !log hashar@mira Synchronized portals: (no message) (duration: 01m 01s) [13:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:48] oh my f**** g*** [13:14:54] ? [13:15:02] jan_drewniak_: so that is all deployed [13:15:13] hashar: yay! [13:15:13] but the urls have not been purged [13:15:14] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:20] lmfao [13:15:25] cause on mira, running mwscript purgeList.php yields: [13:15:27] PHP Fatal error: Class 'Memcached' not found in /srv/mediawiki-staging/php-1.28.0-wmf.21/includes/libs/objectcache/MemcachedPeclBagOStuff.php on line 63 [13:15:27] Fatal error: Class 'Memcached' not found in /srv/mediawiki-staging/php-1.28.0-wmf.21/includes/libs/objectcache/MemcachedPeclBagOStuff.php on line 63 [13:15:30] known issue [13:15:45] moritzm: got one bad hit on mira. It lacks the Zend PHP5 packages :( [13:15:55] moritzm: but mwscript is still hardcoded to use 'php5' [13:16:12] pretty sure we got rid of Zend packages for Jesse intentionally [13:16:40] hashar want me to request on phab that we un hardcode php 5 on mwscript?? [13:16:48] (03PS5) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [13:16:50] (03PS5) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [13:16:53] (03PS5) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [13:16:55] (03PS2) 10Alexandros Kosiaris: icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 [13:16:56] (03PS2) 10Alexandros Kosiaris: naggen2: Kill hostextinfo support [puppet] - 10https://gerrit.wikimedia.org/r/315243 [13:16:58] (03PS2) 10Alexandros Kosiaris: Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 [13:17:01] (03PS2) 10Alexandros Kosiaris: icinga: Remove the last vestiges of hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315245 [13:17:02] (03PS1) 10Alexandros Kosiaris: site.pp: Minor comment fixes [puppet] - 10https://gerrit.wikimedia.org/r/315252 [13:17:03] Zppix: yeah there is a ticket from back February iirc [13:17:05] (03PS1) 10Alexandros Kosiaris: monitoring_hosts: Add tegmen/einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/315253 [13:17:06] (03PS1) 10Alexandros Kosiaris: ganglia: Remove neon as a gmetad allowed host [puppet] - 10https://gerrit.wikimedia.org/r/315254 [13:17:09] (03PS1) 10Alexandros Kosiaris: ntp: Update neon specific ACLs to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/315255 [13:17:11] (03PS1) 10Alexandros Kosiaris: role::mariadb: Remove neon ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/315256 [13:17:12] (03PS1) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [13:17:18] ah [13:17:19] ^^ HOLY SH!! [13:17:22] https://gerrit.wikimedia.org/r/#/c/313305/ would got PHP5 back [13:17:29] (03CR) 10jenkins-bot: [V: 04-1] icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 (owner: 10Alexandros Kosiaris) [13:17:42] akosiaris: I am afraid they are all going to fail :D [13:18:07] let them fail [13:18:12] (03CR) 10jenkins-bot: [V: 04-1] icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 (owner: 10Alexandros Kosiaris) [13:18:22] hashar: ah, indeed. having a look at the patch [13:18:37] (03CR) 10Zppix: [C: 031] Bring back Zend PHP on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/313305 (https://phabricator.wikimedia.org/T146286) (owner: 10Hashar) [13:18:46] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 13Patch-For-Review: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2705534 (10hashar) So we have switched today the primary deployment server to mira.codfw.wmnet which is running Jessie. The European SWA... [13:18:50] (03CR) 10jenkins-bot: [V: 04-1] icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 (owner: 10Alexandros Kosiaris) [13:19:22] moritzm: and rest of wall of text is on https://phabricator.wikimedia.org/T146286 I guess [13:19:39] (03CR) 10jenkins-bot: [V: 04-1] icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 (owner: 10Alexandros Kosiaris) [13:20:04] jan_drewniak_: I will purge the URLs later on [13:20:06] hashar: looks fine, running that through PCC and then I'll merge [13:20:35] (03CR) 10jenkins-bot: [V: 04-1] naggen2: Kill hostextinfo support [puppet] - 10https://gerrit.wikimedia.org/r/315243 (owner: 10Alexandros Kosiaris) [13:20:46] moritzm: and I think we will need the same when we move terbium/wassat to Jessie [13:20:53] but on whatever puppet class they are using [13:20:57] ack, will prepare a similar patch for those [13:21:03] unless of course, we get mwscript to no more use php5 [13:21:20] I am pretty sure hhvm is all fine now. It got hardcoded to php5 ages ago to workaround some crazy issue [13:21:26] and I guess it is no more needed [13:21:29] hashar: that's fine [13:21:36] (03CR) 10jenkins-bot: [V: 04-1] Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 (owner: 10Alexandros Kosiaris) [13:21:36] brewing a coffee [13:21:38] hashar: btw, who is the culprit for those fails ? zuul ? [13:21:51] or some subcomponent ? [13:21:57] jessie [13:22:05] akosiaris: the process 'zuul-merger' which pick your patch and tries to merge it on tip of origin/production branch [13:22:06] james ? [13:22:07] :P [13:22:28] which fails somehow horribly I assume [13:22:33] yeah [13:22:37] hmm [13:22:38] ok [13:22:39] thanks [13:22:43] (03CR) 10jenkins-bot: [V: 04-1] icinga: Remove the last vestiges of hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315245 (owner: 10Alexandros Kosiaris) [13:22:47] (03PS2) 10Muehlenhoff: Bring back Zend PHP on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/313305 (https://phabricator.wikimedia.org/T146286) (owner: 10Hashar) [13:22:48] akosiaris: if you are REALLY curious, you can look on scandium.eqiad.wmnet in /var/log/zuul/merger-debug.log :] [13:22:50] (03CR) 10jenkins-bot: [V: 04-1] site.pp: Minor comment fixes [puppet] - 10https://gerrit.wikimedia.org/r/315252 (owner: 10Alexandros Kosiaris) [13:22:53] akosiaris it causes a nuclear missle launch process :P [13:23:01] it owuld have the exact output of the 'git merge -s resolve' [13:23:02] (03CR) 10jenkins-bot: [V: 04-1] monitoring_hosts: Add tegmen/einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/315253 (owner: 10Alexandros Kosiaris) [13:23:13] need a coffee brb [13:23:20] (03CR) 10jenkins-bot: [V: 04-1] ganglia: Remove neon as a gmetad allowed host [puppet] - 10https://gerrit.wikimedia.org/r/315254 (owner: 10Alexandros Kosiaris) [13:23:45] (03CR) 10jenkins-bot: [V: 04-1] ntp: Update neon specific ACLs to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/315255 (owner: 10Alexandros Kosiaris) [13:24:29] Zppix: then we need dolf ludgren I guess.. [13:24:33] https://en.wikipedia.org/wiki/The_Peacekeeper [13:24:59] (03CR) 10jenkins-bot: [V: 04-1] role::mariadb: Remove neon ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/315256 (owner: 10Alexandros Kosiaris) [13:25:04] dolph* [13:25:07] and lmfao [13:25:11] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2705543 (10Ottomata) > Also, having modules/profiles is what everyone is doing. Ah interesting, I thought we were making this stuff up. Just read [[ ht... [13:25:35] yeah ludgren was incorrect too.. it's Lundgren [13:25:54] but I am happy the joke was good [13:26:20] (03CR) 10jenkins-bot: [V: 04-1] Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 (owner: 10Alexandros Kosiaris) [13:26:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bring back Zend PHP on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/313305 (https://phabricator.wikimedia.org/T146286) (owner: 10Hashar) [13:26:54] i just ate breakfeast but im freaking starving rn [13:27:15] good thing my desk has a box of poptarts :P [13:27:24] (03PS2) 10Alexandros Kosiaris: site.pp: Minor comment fixes [puppet] - 10https://gerrit.wikimedia.org/r/315252 [13:28:39] (03CR) 10Alexandros Kosiaris: [C: 032] site.pp: Minor comment fixes [puppet] - 10https://gerrit.wikimedia.org/r/315252 (owner: 10Alexandros Kosiaris) [13:28:55] (03PS2) 10Alexandros Kosiaris: monitoring_hosts: Add tegmen/einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/315253 [13:28:57] moritzm: would you mind force running puppet on mira ? [13:29:03] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:29:41] hashar: on it, first doublechecking on tin, then I'll run it on mira [13:29:56] Zppix: my desk just have a nice keyboard, some banana, a fresh coffee and my phone. Pretty minimalist :] [13:30:10] hashar you work at WMF? no? [13:30:24] (03CR) 10Alexandros Kosiaris: [C: 032] monitoring_hosts: Add tegmen/einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/315253 (owner: 10Alexandros Kosiaris) [13:30:27] moritzm: I had the patch cherry picked on the beta cluster deploy server. I guess that is where we have found out some php5 extensions were missing [13:30:38] Zppix: yeah I am a vendor for them [13:30:55] babysitting deployment as I can (though others here are much more knowledgeable than me on that topic) [13:30:56] hashar well im not im at home eating poptarts from my box thats in my desk xD [13:31:04] hashar: puppet running on mira [13:31:06] (03CR) 10Ottomata: [C: 031] Remove old wikistats cron script causing cron-spam [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) (owner: 10Elukey) [13:31:12] Zppix: and dealing with the whole mess of continuous integration (eg Jenkins) [13:31:24] Jenkins i've heard is quite the b!tch [13:31:35] GitCommandError: 'git merge -s resolve FETCH_HEAD' returned with exit code 1 [13:31:35] stderr: 'error: Merge requires file-level merging [13:31:35] ERROR: content conflict in modules/monitoring/manifests/group.pp [13:31:38] heh [13:31:42] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:31:42] maybe that explains it [13:31:49] akosiaris: yeah the failkures this morning were related to the same file :( [13:31:55] true [13:32:01] gotta love puppets :P [13:32:12] akosiaris: havent really investigated about it, I gave up and asked you to rebase then they passed. No idea what is happening though :( [13:32:26] and for patches that are actually before the one that patches that file [13:32:29] moritzm: looks good now :) [13:32:29] weird [13:32:57] hashar: ack, all done [13:33:09] !log mira: purging portals URLs for jan_drewniak_ : cat /srv/mediawiki-staging/portals/urls-to-purge.txt | mwscript purgeList.php [13:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:21] jan_drewniak_: one url purged :] [13:33:27] LMFAO [13:34:01] my cat just jumped off my dresser and landed on a shelf and is now confused [13:34:06] (03PS4) 10Elukey: Remove old wikistats cron script causing cron-spam [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) [13:34:36] (03PS2) 10Elukey: Add extra compiler warnings to the Makefile [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/314662 (https://phabricator.wikimedia.org/T147436) [13:35:27] let's try this once more [13:35:42] (03PS6) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [13:35:44] (03PS6) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [13:35:46] (03PS2) 10Alexandros Kosiaris: ntp: Update neon specific ACLs to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/315255 [13:35:48] (03PS6) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [13:35:50] (03PS2) 10Alexandros Kosiaris: ganglia: Remove neon as a gmetad allowed host [puppet] - 10https://gerrit.wikimedia.org/r/315254 [13:35:52] (03PS2) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [13:35:54] (03PS2) 10Alexandros Kosiaris: role::mariadb: Remove neon ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/315256 [13:35:56] (03PS3) 10Alexandros Kosiaris: icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 [13:35:58] (03PS3) 10Alexandros Kosiaris: naggen2: Kill hostextinfo support [puppet] - 10https://gerrit.wikimedia.org/r/315243 [13:36:00] (03PS3) 10Alexandros Kosiaris: Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 [13:36:02] (03PS3) 10Alexandros Kosiaris: icinga: Remove the last vestiges of hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315245 [13:36:03] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 13Patch-For-Review: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2705567 (10hashar) Moritz has run puppet on mira.codfw.wmnet (Jessie) and that fixed the issue above. We will most probably want to a si... [13:36:07] moritzm: solved thank you! [13:36:16] moritzm: so I guess we have more or less validated deployment from mira [13:36:23] first with analytics that managed to scap deploy something [13:36:29] now that portal change + php5 modules [13:36:45] still have to try out Trebuchet :( [13:36:54] (03CR) 10Elukey: [C: 032] Remove old wikistats cron script causing cron-spam [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) (owner: 10Elukey) [13:38:06] Zppix: Jenkins really just run whatever one ask it to run :D [13:38:27] seems fine now [13:38:56] jan_drewniak_: the purge did not work for some reason :( [13:39:01] I still see the old blue [13:39:38] (03PS1) 10Muehlenhoff: Provide PHP packages for mwscript [puppet] - 10https://gerrit.wikimedia.org/r/315260 (https://phabricator.wikimedia.org/T146286) [13:39:41] trebuchet is the font that trump uses hashar and we all *love* him :P [13:40:11] in other words i hate trebuchet [13:41:06] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/4286/stat1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) (owner: 10Elukey) [13:41:48] jan_drewniak_: well I force reloaded and everything is fine :] [13:41:53] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:42:01] anyone know when wmf.22 gets deployed? [13:42:16] https://wikitech.wikimedia.org/wiki/Deployments [13:42:33] hashar: yeah was just gonna say it looks good to me! [13:42:42] Zppix: https://tools.wmflabs.org/versions/ for current deployed version, and bottom links have calendar and roadmap (which reedy pointed to ) [13:42:52] jan_drewniak_: I ran the purge ahother time just to be sure [13:46:35] ty [13:50:29] (03PS4) 10Eevans: Add time-window compaction strategy jar to classpath [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) [13:50:41] (03CR) 10Eevans: [C: 031] "Ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [13:53:11] i have a core change that i would love to get merged before .22 goes out [13:54:42] 06Operations, 10ChangeProp, 06Services, 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2705590 (10mobrovac) [13:55:22] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:57] (03PS6) 10Muehlenhoff: Generate stats for monthly package upgrade activity [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) [13:56:30] !log European SWAT is done. [13:56:31] can someone with mw/config access verify something for me? [13:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:05] Zppix: better hurry cause we are cutting wmf.22 in a few hours :] [13:57:26] ok, i need to know if rollback on ru.wiki is set up correctly [13:57:33] in config [13:58:16] is there even config for rollback? [13:58:19] IDK [13:58:38] and that type of config will be public at noc.wikimedia.org [13:58:52] oh really? [13:59:08] never knew that existed :P sorry if the above cmnt sounded sarcastic [13:59:47] ok i see no PUBLIC config for rollback [14:00:42] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:00:48] (03PS2) 10Rush: labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 [14:01:02] I think ru wiki is having some cookie/session hist issues ( see T147756 ) [14:01:03] T147756: Rollback on ru.wp often fails with "problem with your login session"; reloading and trying again doesn't help - https://phabricator.wikimedia.org/T147756 [14:01:17] (03CR) 10Muehlenhoff: [C: 032] Generate stats for monthly package upgrade activity [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) (owner: 10Muehlenhoff) [14:01:43] (03CR) 10jenkins-bot: [V: 04-1] labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 (owner: 10Rush) [14:05:20] (03PS3) 10Rush: labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 [14:09:15] (03PS4) 10Rush: labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 [14:10:20] (03CR) 10Elukey: [C: 032] "Had a chat with Eric on IRC:" [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [14:10:26] (03PS5) 10Elukey: Add time-window compaction strategy jar to classpath [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [14:12:32] urandom: merged --^ [14:12:44] elukey: thank you sirR! [14:12:50] s/R// [14:12:53] :) [14:13:38] (03PS5) 10Rush: labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 [14:14:05] can someone please merge https://gerrit.wikimedia.org/r/302063 [14:14:57] s/merge/review/ [14:19:26] completed anomie_ [14:20:01] !log T133395: Restarting xenon.eqiad.wmnet to apply https://gerrit.wikimedia.org/r/314603 [14:20:03] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [14:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:00] (03PS1) 10Gehel: Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) [14:24:48] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2705670 (10chasemp) note: we should update this contact info https://wikitech.wikimedia.org/wiki/Add_a_wiki#Start [14:25:06] (03CR) 10jenkins-bot: [V: 04-1] Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [14:25:44] Zppix: Your session/cookie thing makes me suspect a problem with the redis storage for session data. [14:25:52] (03PS1) 10Gilles: Add mtail program to track thumbor OOM kills [puppet] - 10https://gerrit.wikimedia.org/r/315272 [14:27:17] 06Operations, 10Traffic: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705444 (10mark) Well... LVS itself relies on pybal_config of course... :) [14:27:58] !log upgraded zuul on scandium (T147073) [14:27:59] T147073: Upgrade Zuul on scandium to 2.5.0-8-gcbc7f62-wmf3jessie1 - https://phabricator.wikimedia.org/T147073 [14:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:03] 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on scandium to 2.5.0-8-gcbc7f62-wmf3jessie1 - https://phabricator.wikimedia.org/T147073#2705708 (10hashar) 05Open>03Resolved a:03hashar Solved by @elukey but that does not fix it :( I have screwed up the package and the shebang... [14:31:45] nice one halfak [14:31:50] sorry hashar [14:35:35] (03PS6) 10Rush: labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 [14:42:00] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2705758 (10Cmjohnson) HP Support Case ID: 5314079417 [14:47:15] (03CR) 10Muehlenhoff: "NOP on the current trusty-based systems: http://puppet-compiler.wmflabs.org/4293/" [puppet] - 10https://gerrit.wikimedia.org/r/315260 (https://phabricator.wikimedia.org/T146286) (owner: 10Muehlenhoff) [14:49:28] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:52:07] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:53:14] cmjohnson1: hi! Just wanted to ask you if you'd have time to check https://phabricator.wikimedia.org/T147707 this week [14:55:22] 06Operations, 10Traffic: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705784 (10BBlack) I don't think it uses it directly. LVS/pybal talks directly to etcd, whereas this is just an HTTP view of the same data. [14:56:39] 06Operations, 10Traffic: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705785 (10BBlack) (well, beyond that, I think it's exposing the old text files even, not the new etcd data?). [14:57:46] !log Upgrading Zuul on gallium 2.5.0-8-gcbc7f62-wmf2precise1 2.5.0-8-gcbc7f62-wmf3precise1 (merely a noop for zuul scheduler) [14:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:57] 06Operations, 10Traffic: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705444 (10Joe) @bblack you are correct saying it isn't used by pybal anymore, but incorrect when saying it's exposing the old text files: the current files are generated from etcd. [14:58:03] !log Upgrading Zuul on gallium 2.5.0-8-gcbc7f62-wmf2precise1 2.5.0-8-gcbc7f62-wmf3precise1 (merely a noop for zuul scheduler) T147070 [14:58:04] T147070: Upgrade Zuul on gallium to 2.5.0-8-gcbc7f62-wmf3precise1 - https://phabricator.wikimedia.org/T147070 [14:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:58:55] elukey: yes, I will get to it today [14:59:12] woooo [14:59:14] thank you!!! [14:59:23] 06Operations, 10Traffic: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705793 (10BBlack) ah ok, the format threw me off, it makes sense now :) [14:59:32] (03PS2) 10Giuseppe Lavagetto: hiera: convert eqiad as well [puppet] - 10https://gerrit.wikimedia.org/r/315246 [15:00:19] anome i fixed what you commented on, https://gerrit.wikimedia.org/r/302063 [15:03:02] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.24 [debs/linux44] - 10https://gerrit.wikimedia.org/r/315215 (owner: 10Muehlenhoff) [15:04:47] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2705801 (10MoritzMuehlenhoff) mira is now the primary deployment server. tin will be reimaged to jessie on the 18th. After that we can switch bac... [15:04:52] (03PS1) 10Jforrester: Enable the visual editor for logged-in users on remaining phase 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315277 (https://phabricator.wikimedia.org/T142589) [15:04:54] (03PS1) 10Jforrester: Enable the visual editor for all users on remaining phase 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315278 (https://phabricator.wikimedia.org/T142589) [15:07:05] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: convert eqiad as well [puppet] - 10https://gerrit.wikimedia.org/r/315246 (owner: 10Giuseppe Lavagetto) [15:14:07] (03CR) 10Jforrester: [C: 04-1] "Not just yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315278 (https://phabricator.wikimedia.org/T142589) (owner: 10Jforrester) [15:14:55] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2705840 (10RobH) @joe: The new leased servers for this are on site, so if these temp hosts haven't been used yet, you may just wan... [15:15:18] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev (restricted group) - https://phabricator.wikimedia.org/T147666#2705841 (10RobH) [15:16:13] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2705847 (10Joe) @robh I was about to use them tomorrow, at least to test the installation of docker [15:16:53] (03PS5) 10Giuseppe Lavagetto: hiera: complete transition in nuyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/315202 (https://phabricator.wikimedia.org/T147403) [15:17:51] (03PS2) 10Jforrester: Enable the visual editor for logged-in users on remaining phase 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315277 (https://phabricator.wikimedia.org/T142589) [15:18:54] (03CR) 10Alex Monk: "Commit message misses arcwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315278 (https://phabricator.wikimedia.org/T142589) (owner: 10Jforrester) [15:20:09] (03PS2) 10Jforrester: Enable the visual editor for all users on remaining phase 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315278 (https://phabricator.wikimedia.org/T142589) [15:20:23] (03CR) 10Jforrester: "Thanks for the spot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315278 (https://phabricator.wikimedia.org/T142589) (owner: 10Jforrester) [15:20:27] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: complete transition in nuyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/315202 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [15:25:12] 06Operations, 07Puppet, 13Patch-For-Review, 15User-Joe: Change behaviour of expand_path in hiera lookups. - https://phabricator.wikimedia.org/T147403#2705875 (10Joe) 05Open>03Resolved [15:28:49] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:36] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2705900 (10Marostegui) We have added olowiki filtering. We executed the following command ``` root@neodymium:/home/jynus/software/redactatron/scr... [15:34:39] (03CR) 10Faidon Liambotis: [C: 031] "Is the aborted inquiry related to the timeout? In any case, increasing the timeout sounds fine to me but there's only so much we can incre" [puppet] - 10https://gerrit.wikimedia.org/r/315103 (owner: 10Filippo Giunchedi) [15:43:45] (03CR) 10Faidon Liambotis: raid: tweak check_interval for forking checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315107 (owner: 10Filippo Giunchedi) [15:45:53] cmjohnson1: I've copied /home and the pbuilder result off copper now and downtimed, ready to go [15:46:08] okay..cool! Thx [15:48:13] _joe_: not on the other channel due to some IRC issues... how's the k8s packaging going? [15:52:06] <_joe_> yuvipanda: horribly [15:52:16] <_joe_> yuvipanda: but, I am working on docker::engine [15:52:26] haha nice [15:52:56] (03CR) 10Ori.livneh: [C: 031] "Looks good, but deploy with care since it requires an Apache configuration reload on all app servers." [puppet] - 10https://gerrit.wikimedia.org/r/314519 (owner: 10Elukey) [15:54:32] joe: the puppet class or the deb package? [15:55:09] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:04] <_joe_> yuvipanda: the puppet class [15:58:18] <_joe_> the package is very complex to build on jessie [15:58:31] <_joe_> I guess that's the reason why jessie-backports are lagging behind [15:58:31] _joe_: ah, ok. do poke me for review, to make sure we don't break tools :D [15:58:38] <_joe_> yuvipanda: of course [15:58:39] godog: disks swapped, it's in the installer now [15:58:55] _joe_: yeah, i'm not surprised at all (that it's hard to build) [15:59:07] I personally have no hope that that'll ever change for any go projects [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161011T1600). [16:00:05] Dereckson: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:43] cmjohnson1: thanks! I'll kick off a reimage with wmf-auto-reimage so cleanup is taken care of [16:01:33] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:58] 06Operations, 10Gerrit: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2705957 (10Dzahn) [16:03:00] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2705956 (10Dzahn) [16:04:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Follow-up Ifa2cc187: Add ShortUrl support on wikimedia.org docroot sites (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311647 (https://phabricator.wikimedia.org/T146014) (owner: 10Alex Monk) [16:06:56] 06Operations, 10ops-eqiad, 10DBA: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774#2703076 (10Cmjohnson) @Marostegui I can move this server to A2. Give me the go ahead once you have powered off and it's safe to move. [16:07:29] <_joe_> Dereckson: that patch is not ready for being published [16:08:21] (03PS1) 10Giuseppe Lavagetto: lvm: add module from puppetlabs. [puppet] - 10https://gerrit.wikimedia.org/r/315293 [16:08:23] (03PS1) 10Giuseppe Lavagetto: docker::engine: remove execs, transform to pure-puppet [puppet] - 10https://gerrit.wikimedia.org/r/315294 [16:08:24] 06Operations, 10ops-eqiad, 10DBA: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774#2705987 (10Marostegui) Thanks Chris - I have been told I have to update DNS with the new IP, is that something you can give me beforehand or it will just dhcp it? Also, given that tomorro... [16:08:30] <_joe_> yuvipanda: ^^ [16:08:44] <_joe_> yuvipanda: it's a bit of a WiP, but should give you an idea [16:09:25] (03CR) 10jenkins-bot: [V: 04-1] lvm: add module from puppetlabs. [puppet] - 10https://gerrit.wikimedia.org/r/315293 (owner: 10Giuseppe Lavagetto) [16:09:30] <_joe_> ahah [16:09:35] 06Operations, 10ops-eqiad, 10DBA: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774#2705989 (10Cmjohnson) @Marostegui I will fix the dns for you once it's moved to row A. [16:09:36] <_joe_> that's from puppetlabs, ofc [16:09:47] joe nice. do those defines already exist? [16:09:58] or are you going to have to write them? [16:10:11] 06Operations, 10ops-eqiad, 10DBA: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774#2705990 (10Marostegui) @Cmjohnson Excellent! Thanks!. So let's wait till Thursday then [16:10:25] joe aaah, I see the lvm module [16:10:27] nice [16:11:13] _joe_: Krenair asserted it as ready, and needed for the MediaWiki extension installation. What test would you like? [16:11:26] 06Operations, 10Gerrit: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2705992 (10Dzahn) Oh, it turns out it's not just permissions on the file, it's also that the DB part must have changed. We now see on both, old and new server, that: ``` {"type":"error","mess... [16:11:51] <_joe_> Dereckson: I commented on the patch [16:12:25] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2706000 (10jcrespo) a:05jcrespo>03chasemp Nominatively assigning to Chase, as the production part is done, but of course feel free to reassign... [16:12:41] 06Operations, 10ops-eqiad: system WMF3096 lacking details in racktables - https://phabricator.wikimedia.org/T145156#2706002 (10Cmjohnson) 05Open>03Resolved Resolving and creating a decom task [16:13:11] ah, seen your comments [16:15:39] _joe_: it follows the configuration in modules/mediawiki/files/apache/sites/main.conf [16:15:51] 06Operations, 10ops-eqiad: Decommission wmf3096 - https://phabricator.wikimedia.org/T147860#2706008 (10Cmjohnson) [16:15:56] <_joe_> Dereckson: well, that's wrong then :P [16:16:08] <_joe_> but let me take a closer look then [16:16:10] 06Operations, 10ops-eqiad: Decommission wmf3096 - https://phabricator.wikimedia.org/T147860#2706020 (10Cmjohnson) p:05Triage>03Low [16:16:39] 06Operations, 10ops-eqiad: Decommission strontium - https://phabricator.wikimedia.org/T142722#2706022 (10Cmjohnson) p:05Triage>03Normal [16:16:59] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2706024 (10Cmjohnson) p:05Normal>03High [16:17:13] (03PS1) 10Dzahn: gerrit: disable reviewer-counts cron job [puppet] - 10https://gerrit.wikimedia.org/r/315296 (https://phabricator.wikimedia.org/T147776) [16:17:47] 06Operations, 10ops-eqiad: Investigate strontium disk issues on 2016-08-05 - https://phabricator.wikimedia.org/T142187#2706028 (10Cmjohnson) p:05Triage>03Normal @akosiaris What do you want to do about this server? [16:18:23] <_joe_> Dereckson: actually we have -inconsistently- either the ProxyPass or the RewriteRule :( [16:18:52] 06Operations, 10Gerrit, 13Patch-For-Review: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2706030 (10Dzahn) Why the syntax error? We just moved gerrit, we did not upgrade the version. If anything i'd expect a permissions issue, but syntax ?? [16:20:16] (03PS1) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315297 (https://phabricator.wikimedia.org/T147508) [16:20:18] (03PS1) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315298 (https://phabricator.wikimedia.org/T147508) [16:20:20] (03PS1) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315299 (https://phabricator.wikimedia.org/T147508) [16:21:24] jynus hi, im wondering could you take a look at https://phabricator.wikimedia.org/T147776 please? [16:21:30] Since it is mariadb specific [16:21:35] {"type":"error","message":"You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ''SELECT changes.change_id AS change_id, COUNT(DISTINCT patch_set_approvals.accou' at line 1"} [16:22:03] (03CR) 10Dzahn: [C: 032] gerrit: disable reviewer-counts cron job [puppet] - 10https://gerrit.wikimedia.org/r/315296 (https://phabricator.wikimedia.org/T147776) (owner: 10Dzahn) [16:22:08] (03PS2) 10Dzahn: gerrit: disable reviewer-counts cron job [puppet] - 10https://gerrit.wikimedia.org/r/315296 (https://phabricator.wikimedia.org/T147776) [16:22:12] <_joe_> Dereckson: I can fix that patch and merge it tomorrow I guess [16:22:16] paladox, that is not a mysql problem- the query is wrong [16:22:24] Oh [16:22:27] mutante ^^ [16:22:28] "You have an error in your SQL syntax" [16:22:50] thanks [16:22:51] probably an extra quote [16:23:29] Oh [16:23:34] thanks [16:23:37] elukey: do you have a way of determining which disk is /dev/sdi? maybe ottomata remembers [16:24:19] /usr/bin/java -jar /var/lib/gerrit2/review_site/bin/gerrit.war gsql -d /var/lib/gerrit2/review_site/ --format JSON_SINGLE -c \'SELECT changes.change_id AS change_id, COUNT(DISTINCT patch_set_approvals.account_id) AS reviewer_count FROM changes LEFT JOIN patch_set_approvals ON (changes.change_id = patch_set_approvals.change_id) GROUP BY changes.change_id'\ > /var/www/reviewer-counts.json [16:24:24] So like that^^ [16:24:29] jynus ^^ [16:24:33] oh it's happening (the disk replacement), yay [16:24:34] Without the " [16:24:38] cmjohnson1: sdi is still mounted [16:24:43] on kafka1018, right? [16:24:47] i could write bytes to it [16:24:49] right [16:24:51] like we've done before [16:25:00] paladox: how about checking git log if anyone changed it? [16:25:01] oh [16:25:05] perhaps not [16:25:05] ls: reading directory .: Input/output error [16:25:09] Ok [16:25:14] Going to do that now [16:25:16] it's "mounted" but pretty broken yeah [16:25:33] I think I know which one it is but I don't want to screw anything up for you if I pull the wrong disk [16:25:34] _joe_: okay, perhaps also uniformize the /s for other sites? [16:25:47] cmjohnson1: kafka is down there [16:25:55] https://github.com/wikimedia/operations-puppet/commit/ae6f877c6a430e69f9e1aa8c9d1a1d6a79a5aaba [16:25:58] so this broker is not being used [16:25:58] LOL mutante ^^ [16:26:00] you can pull [16:26:03] (03CR) 10Yuvipanda: Move toollabs node classes to roles. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314180 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [16:26:06] okay, cool [16:26:07] It was changed when we were migrating to lead [16:26:48] megacli -CfgForeign -Clear -a0 [16:27:32] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:59] jynus: he was asking because that query is puppetized and didnt change recently [16:28:02] RECOVERY - MD RAID on copper is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:28:19] obviously syntax error, but kind of odd that it happens now [16:28:32] mutante, I do not even know where that is being queried from [16:28:36] or to [16:28:42] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no logs - https://phabricator.wikimedia.org/T147769#2702976 (10RobH) Seems the Dell tech is asking Papaul for hardware logs: Syslog shows nothing for the hard crash: Oct 10 03:29:34 es2015 puppet-agent[172665]: Retrieving pluginfac... [16:28:48] jynus it's for reviewer count [16:28:50] from gerrit [16:29:10] but a syntax error is a syntax error, unless someone has tweaked the db config [16:29:46] I am right now busy with hardware problems, unless this is creating an outage, it will have to wait [16:30:16] Found [16:30:17] it [16:30:21] jynus i found a fix [16:30:26] Tested on our test install [16:30:29] that matches gerrit [16:30:31] mutnate ^^ [16:30:41] 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2706063 (10Cmjohnson) Replaced the disk and it's back to unconfigured good...you will need to make jbod. [16:30:45] /usr/bin/java -jar /var/lib/gerrit2/review_site/bin/gerrit.war gsql -d /var/lib/gerrit2/review_site/ --format JSON_SINGLE -c 'SELECT changes.change_id AS change_id, COUNT(DISTINCT patch_set_approvals.account_id) AS reviewer_count FROM changes LEFT JOIN patch_set_approvals ON (changes.change_id = patch_set_approvals.change_id) GROUP BY changes.change_id' > /var/www/reviewer-counts.json [16:30:47] Is the command [16:30:57] noticed i removed \ and " [16:31:00] i now get [16:31:07] [{"type":"row","columns":{"change_id":"1","reviewer_count":"0"}},{"type":"query-stats","rowCount":1,"runTimeMilliseconds":14}] [16:31:11] Which looks correct [16:31:18] gerrit-test3 [16:31:24] Uploading a patch now [16:31:31] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2706067 (10Joe) >>! In T147718#2705543, @Ottomata wrote: >> Also, having modules/profiles is what everyone is doing. > Ah interesting, I thought we were... [16:32:09] paladox: ... interesting .. i mean the "how did it change" part [16:32:21] Not sure though [16:33:26] (03PS1) 10Paladox: Gerrit: Fix reviewer-counts.json cronspam by removing \ and " [puppet] - 10https://gerrit.wikimedia.org/r/315300 (https://phabricator.wikimedia.org/T147776) [16:33:27] mutante ^^ [16:33:43] ottomata: proceeding with https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration#Swapping_broken_disk [16:33:46] ok? [16:35:27] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no logs - https://phabricator.wikimedia.org/T147769#2706079 (10jcrespo) From the IDRAC 8 web console: ``` Log: Normal Mon Feb 08 2016 16:08:44 Log cleared. Critical Mon Oct 10 2016 03:52:20 CPU 1 has an internal error (IERR). Lifec... [16:35:33] elukey: ok! [16:35:34] (03PS1) 10Paladox: Gerrit: Also list mediawiki skins [puppet] - 10https://gerrit.wikimedia.org/r/315301 [16:36:17] (03PS1) 10Andrew Bogott: Rename role::labs::tools::* to role::toollabs::* [puppet] - 10https://gerrit.wikimedia.org/r/315302 [16:37:33] (03PS2) 10Paladox: Gerrit: Fix reviewer-counts.json cronspam by removing \ and " [puppet] - 10https://gerrit.wikimedia.org/r/315300 (https://phabricator.wikimedia.org/T147776) [16:37:48] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2706093 (10jcrespo) [16:38:13] (03PS2) 10Paladox: Gerrit: Also list mediawiki skins [puppet] - 10https://gerrit.wikimedia.org/r/315301 [16:38:43] (03CR) 10Ori.livneh: [C: 032] Refactor memcached role to allow a more flexible hieradata config [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [16:38:49] oops [16:39:01] (03CR) 10Ori.livneh: [C: 031] "Looks good. Haven't tested." [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [16:39:14] I do not know the context, but \"' seems like too much quoting [16:39:29] Yep [16:41:41] (03PS1) 10Giuseppe Lavagetto: etcd::auth::common: always install etcd-manage [puppet] - 10https://gerrit.wikimedia.org/r/315303 [16:42:06] yes, that part seems obvious, i just dont get how it ever worked before on lead and started now [16:42:09] oh well [16:42:29] not worth it, will just merge the fix, yea [16:42:52] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd::auth::common: always install etcd-manage [puppet] - 10https://gerrit.wikimedia.org/r/315303 (owner: 10Giuseppe Lavagetto) [16:43:01] (03PS2) 10Giuseppe Lavagetto: etcd::auth::common: always install etcd-manage [puppet] - 10https://gerrit.wikimedia.org/r/315303 [16:43:03] (03PS3) 10Paladox: Gerrit: Fix reviewer-counts.json cronspam by removing \ and " [puppet] - 10https://gerrit.wikimedia.org/r/315300 (https://phabricator.wikimedia.org/T147776) [16:43:19] (03CR) 10Giuseppe Lavagetto: [V: 032] etcd::auth::common: always install etcd-manage [puppet] - 10https://gerrit.wikimedia.org/r/315303 (owner: 10Giuseppe Lavagetto) [16:43:52] (03CR) 10Paladox: "Reverted in https://gerrit.wikimedia.org/r/#/c/315300/" [puppet] - 10https://gerrit.wikimedia.org/r/315296 (https://phabricator.wikimedia.org/T147776) (owner: 10Dzahn) [16:43:56] <_joe_> mutante: ok to merge your change? [16:44:22] <_joe_> I guess so [16:44:26] <_joe_> it disables a cron [16:44:38] 06Operations, 10Gerrit, 13Patch-For-Review: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2706106 (10Paladox) >>! In T147776#2706030, @Dzahn wrote: > Why the syntax error? We just moved gerrit, we did not upgrade the version. If anything i'd expect a permissions... [16:45:51] ottomata: mmm would you mind to check with me what is the disk to apply the partition to? [16:46:12] 06Operations, 10hardware-requests: EQIAD: (2) hardware access request for PUPPET - https://phabricator.wikimedia.org/T142218#2706112 (10RobH) [16:46:56] _joe_: yes, please [16:47:27] _joe_: sorry, i was about to merge another one that includes a revert [16:47:33] but it should be yes [16:47:59] ottomata: ah maybe sdi [16:48:04] it is mounted but not readable [16:48:05] weird [16:48:14] elukey: sorry [16:48:24] only half following (was eating lunch) [16:48:31] cmjohnson1: has swapped? [16:48:51] ottomata: yes swapped [16:49:50] hmmm [16:50:04] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2706131 (10Papaul) Enterprise Service Request Hello Papaun, Thank you for contacting Dell! This issue has bee... [16:50:04] elukey: gonna try to unmount i [16:50:10] unless you are doing stuff [16:50:21] ottomata: I was about to do it, I only need to create the partition [16:50:27] (03CR) 10Yuvipanda: [C: 031] "+1 for moving! I eventually want us to use yaml rather than json for the info (json is not very human editable and has no comments) but no" [puppet] - 10https://gerrit.wikimedia.org/r/314773 (owner: 10Rush) [16:50:29] elukey: proceed [16:50:43] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2706134 (10Papaul) BIOS: 2.2.5 http://downloads.dell.com/FOLDER03917193M/1/BIOS_PFWCY_WN32_2.2.5.EXE iDRAC-L... [16:50:45] i think i was never unmounted during hte disk swap, because its funky funky [16:51:09] (03PS7) 10Rush: labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 [16:52:46] elukey: lemme know if i can help [16:54:31] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2706173 (10fgiunchedi) copper reinstalled with SSD, I've saved and restored /home and /var/cache/pbuilder/result, testing a package build now [16:55:10] ottomata: yeah if you want to check because fdisk does not like me now [16:55:17] (03CR) 10Rush: [C: 032] labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 (owner: 10Rush) [16:55:22] (03PS1) 10Giuseppe Lavagetto: role::etcd::common: fix auth if inactive [puppet] - 10https://gerrit.wikimedia.org/r/315306 [16:55:27] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2706174 (10fgiunchedi) 05Open>03Resolved package building works, resolving [16:56:17] godog: \o/ [16:57:03] \o/ indeed! [16:57:17] now limited by cpu and dpkg speed, hopefully [16:59:12] (03CR) 10Giuseppe Lavagetto: [C: 032] role::etcd::common: fix auth if inactive [puppet] - 10https://gerrit.wikimedia.org/r/315306 (owner: 10Giuseppe Lavagetto) [16:59:17] (03PS2) 10Giuseppe Lavagetto: role::etcd::common: fix auth if inactive [puppet] - 10https://gerrit.wikimedia.org/r/315306 [16:59:20] (03CR) 10Giuseppe Lavagetto: [V: 032] role::etcd::common: fix auth if inactive [puppet] - 10https://gerrit.wikimedia.org/r/315306 (owner: 10Giuseppe Lavagetto) [17:00:05] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161011T1700). [17:00:28] probably not today [17:01:16] I need to run a maintenance script in wikis [17:01:24] for https://phabricator.wikimedia.org/T145356 [17:01:51] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:03:00] 06Operations, 10ops-eqiad: Investigate strontium disk issues on 2016-08-05 - https://phabricator.wikimedia.org/T142187#2706202 (10akosiaris) @Cmjohnson Let's just decommision it [17:03:23] RECOVERY - puppet last run on etcd1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:03:38] 06Operations, 10ops-eqiad: Decommission strontium - https://phabricator.wikimedia.org/T142722#2706203 (10akosiaris) @Cmjohnson: yup, sounds fine. +1 [17:04:41] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:05:06] !log starting branch cut for 1.28.0-wmf.22 [17:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:06:01] well here we go boys/girls [17:09:41] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:10:46] looking at logstash, mw1272 is having issues: https://logstash.wikimedia.org/goto/d9f8d9aad5b397feb09998ca6927a7c1 [17:10:57] 7k fatals in the last hour [17:12:04] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:12:18] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:13:56] greg-g: confirmed, rebooting it [17:13:59] (in meeting though) [17:14:23] !log mw1272 reboot [17:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:14:43] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [17:16:16] !log rebooting kafka1018 [17:16:18] PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:17:31] RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [17:17:42] thanks mutante [17:18:27] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2706272 (10Joe) So it was decided in the TechOps meeting to put this on stall and use the dockerproject.org package to unblock the res... [17:18:35] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Docker installation for production kubernetes - https://phabricator.wikimedia.org/T147181#2706277 (10Joe) [17:18:37] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2706276 (10Joe) 05Open>03stalled [17:18:43] what the heck is this about (there a few of these in fatalmonitor): https://logstash.wikimedia.org/goto/4c212c9df75ca0d89efa4bc9b5eed29c (low prio curiousity, probably) [17:18:53] greg-g: looks like it's ok now and was kernel/hhvm. kernel: [10254956.128727] BUG: Bad page map in process hhvm [17:19:08] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2691932 (10Joe) a:05Joe>03None [17:20:05] !log mw1272 kernel: [10254957.470558] BUG: Bad page map in process hhvm [17:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:22:34] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:25:03] where should I report redis timeout fatals (other than #wikimedia-log-errors)? https://phabricator.wikimedia.org/T147866 [17:25:13] !log ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/CleanDuplicateScores.php on eight wikis (T145356) [17:25:13] it's spamming the hell out of fatalmonitor [17:25:14] T145356: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356 [17:25:15] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [17:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:08] 06Operations, 10Gerrit, 13Patch-For-Review: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2706315 (10Dzahn) Yes, but how did it ever work before and now suddenly fail on both, old and new server? Or was it broken all the time and only noticed now by coincidence? [17:27:14] 06Operations, 10ChangeProp, 06Services, 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2705503 (10akosiaris) ``` nor does starting it on scb2001 ``` That's interesting. Any idea why ? [17:27:15] RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:27:20] (03PS4) 10Paladox: Gerrit: Fix reviewer-counts.json cronspam by removing \ and " [puppet] - 10https://gerrit.wikimedia.org/r/315300 (https://phabricator.wikimedia.org/T147776) [17:27:45] (03PS1) 10RobH: smalyshev access to restricted usergroup [puppet] - 10https://gerrit.wikimedia.org/r/315308 (https://phabricator.wikimedia.org/T147666) [17:28:03] 06Operations, 10Gerrit, 13Patch-For-Review: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2706327 (10Paladox) @Dzahn probably it was broken all the time. Just wasn't noticed until now. [17:29:21] !log starting mobileapps deploy [17:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:34] RECOVERY - puppet last run on etcd1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:33:01] Going to https://wikimedia.biterg.io/app/kibana#/dashboard/Gerrit?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-2y,mode:quick,to:now))&_a=(filters:!(),options:(darkTheme:!f),panels:!((col:1,id:Gerrit-metrics__gerrit_enrich,panelIndex:2,row:1,size_x:1,size_y:4,title:Gerrit,type:visualization),(col:1,id:Patchsets-per-review__gerrit_enrich,panelIndex:4,row:9,size_x:3,size_y:3,title:'Patchsets%20Statistics%20Per%20Review', [17:33:01] type:visualization),(col:4,id:Time-per-review__gerrit_enrich,panelIndex:5,row:9,size_x:3,size_y:3,title:'Changesets%20Statistics%20(Open%20Time)',type:visualization),(col:2,id:Reviews-by-opening-time__gerrit_enrich,panelIndex:7,row:1,size_x:5,size_y:2,title:'Changesets%20Per%20Status',type:visualization),(col:7,id:Patchsets-per-review-per-month__gerrit_enrich,panelIndex:9,row:9,size_x:6,size_y:3,title:'Patchsets%20Per%20Review',type:visua [17:33:02] lization),(col:2,id:Change-submitters-per-month__gerrit_enrich,panelIndex:11,row:3,size_x:5,size_y:2,title:'Changeset%20Submitters',type:visualization),(col:7,id:Organizations-pie__gerrit_eclipse_enrich,panelIndex:15,row:1,size_x:3,size_y:4,title:Organizations,type:visualization),(col:1,id:gerrit_top_developers,panelIndex:17,row:5,size_x:6,size_y:4,title:Submitters,type:visualization),(col:7,id:gerrit_evolution_organizations,panelIndex:18 [17:33:03] (03PS1) 10Rush: maintain-replicas: no longer kept here [software] - 10https://gerrit.wikimedia.org/r/315311 [17:33:07] ,row:5,size_x:6,size_y:4,title:Organizations,type:visualization),(col:10,id:gerrit_repositories_table,panelIndex:19,row:1,size_x:3,size_y:4,title:Repositories,type:visualization)),query:(query_string:(analyze_wildcard:!t,query:'*')),title:Gerrit,uiState:(P-11:(title:'Changeset%20Submitters'),P-15:(title:Organizations),P-17:(title:Submitters),P-18:(title:Organizations),P-19:(title:Repositories),P-2:(title:Gerrit),P-4:(title:'Patchsets%20St [17:33:12] atistics%20Per%20Review'),P-5:(title:'Changesets%20Statistics%20(Open%20Time)'),P-7:(title:'Changesets%20Per%20Status',vis:(legendOpen:!f)),P-9:(title:'Patchsets%20Per%20Review'),title:Submitters)) [17:33:15] Woopos [17:33:17] Thats a big url [17:33:20] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [17:33:21] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:33:21] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:33:21] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:33:23] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [17:33:23] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [17:33:23] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:33:28] https://wikimedia.biterg.io/app/kibana#/dashboard/Gerrit [17:33:32] ah snap but I put downtime [17:33:35] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:33:35] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:33:35] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:33:35] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:33:35] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:33:35] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:33:35] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:33:36] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:33:38] I am working on kafka1018 [17:33:39] Going to that im getting Could not locate that index-pattern-field (id: project) [17:33:43] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:33:43] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:33:43] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:33:44] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:33:44] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [17:33:57] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [17:33:58] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [17:34:03] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [17:34:04] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [17:34:04] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:34:04] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:34:05] sorry sorry sorry [17:34:08] :) [17:34:14] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:34:14] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:34:14] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:34:14] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:34:23] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [17:34:23] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [17:34:23] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:34:24] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:34:24] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:34:24] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:34:25] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:34:34] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:34:35] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:34:35] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:34:35] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:34:35] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:34:35] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:34:35] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [17:34:35] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [17:34:52] * _joe_ stones elukey [17:34:57] ahahah [17:35:05] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:35:05] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:35:05] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:35:05] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:35:05] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:35:05] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:35:06] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [17:35:06] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [17:35:10] oops [17:35:13] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:35:14] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:35:14] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [17:35:15] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:35:15] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:35:15] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:35:17] did i press the red button again [17:35:23] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:35:23] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:35:24] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:35:24] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:35:24] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:35:24] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:35:24] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:35:25] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:35:25] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:35:26] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [17:35:34] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:35:38] * volans stabs elukey :-P [17:35:52] <_joe_> Zppix: DID YOU? [17:35:54] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [17:35:54] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [17:35:54] <_joe_> :P [17:35:54] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:35:54] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [17:35:55] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [17:36:05] 06Operations, 10ChangeProp, 06Services, 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2706367 (10Pchelolo) >>! In T147849#2706316, @akosiaris wrote: > ``` > nor does starting it on scb2001 > ``` > > That's interesting. Any idea why ? One possible differenc... [17:36:08] <_joe_> Zppix: I know it seems all horrible, but it's just one server rebooting [17:36:51] it's all just kafka1018 [17:36:59] yep [17:37:05] (at the end of the lines) [17:37:21] paladox: that's not something from us (though they are working on it for us, I think through the Developer Relations team), best to ask biterg.io directly [17:37:31] joe i figured i was kidding [17:37:34] Oh [17:37:36] paladox: see the domain name isn't ours :) [17:37:36] ok [17:37:40] yep [17:37:51] greg-g but http://korma.wmflabs.org/browser/scr-contributors.html is? [17:37:57] Which takes us to that new domain [17:38:42] paladox: ask in #wikimedia-devrel [17:38:51] Ok [17:38:56] Thanks [17:38:58] paladox: this is not something that production/ops are associated with [17:40:02] (03CR) 10Rush: [C: 032] maintain-replicas: no longer kept here [software] - 10https://gerrit.wikimedia.org/r/315311 (owner: 10Rush) [17:40:20] oh ok [17:42:37] (03CR) 10Volans: "Can you remove the reference from tox.ini too please? :)" [software] - 10https://gerrit.wikimedia.org/r/315311 (owner: 10Rush) [17:44:26] (03PS1) 10Jdlrobson: Disable bottom language button in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315314 (https://phabricator.wikimedia.org/T143829) [17:44:53] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2706448 (10Papaul) Today october 11th I call Dell Support for this issue. Call time 10:52 am call duration = 54 m... [17:46:39] (03PS1) 10Rush: maintain-replicas: remove tox entries [software] - 10https://gerrit.wikimedia.org/r/315317 [17:46:50] jouncebot: next [17:46:50] In 0 hour(s) and 13 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161011T1800) [17:47:29] (03CR) 10Rush: [C: 032 V: 032] maintain-replicas: remove tox entries [software] - 10https://gerrit.wikimedia.org/r/315317 (owner: 10Rush) [17:51:38] !log deployed mobileapps fc900fc [17:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:51:47] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [17:57:55] (03PS1) 10Rush: labsdb: apply candidate puppet logic for maintain-views to labsdb1008 [puppet] - 10https://gerrit.wikimedia.org/r/315319 [17:59:32] (03Abandoned) 10RobH: ssl cert renewals: ldap-[codfw|eqiad].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/309592 (https://phabricator.wikimedia.org/T145201) (owner: 10RobH) [17:59:41] (03CR) 10RobH: [C: 032] smalyshev access to restricted usergroup [puppet] - 10https://gerrit.wikimedia.org/r/315308 (https://phabricator.wikimedia.org/T147666) (owner: 10RobH) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161011T1800). [18:00:04] James_F and RoanKattouw: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:17] Heya. [18:00:25] (Roan's here too.) [18:00:57] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev (restricted group) - https://phabricator.wikimedia.org/T147666#2706540 (10RobH) 05stalled>03Resolved @Smalyshev This was approved in the ops meeting today, so I've merged your access to the cluster. It may take up to 30 minu... [18:01:03] * RoanKattouw waves [18:01:05] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev (restricted group) - https://phabricator.wikimedia.org/T147666#2706543 (10RobH) a:05RobH>03None [18:01:15] Can't do the deployment myself but I can be here for my patch [18:01:33] I can SWAT today [18:02:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315277 (https://phabricator.wikimedia.org/T142589) (owner: 10Jforrester) [18:03:17] (03Merged) 10jenkins-bot: Enable the visual editor for logged-in users on remaining phase 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315277 (https://phabricator.wikimedia.org/T142589) (owner: 10Jforrester) [18:04:15] 06Operations: Rename rhodium to puppetmaster1003 - https://phabricator.wikimedia.org/T147872#2706559 (10akosiaris) [18:06:10] James_F: your change should be live on mw1099 [18:06:31] * James_F tests. [18:07:14] (03PS2) 10Thcipriani: Enable $wgPageTriageNoIndexUnreviewedNewArticles on all wikis that have PageTriage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314643 (https://phabricator.wikimedia.org/T147544) (owner: 10Catrope) [18:07:43] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:07:48] thcipriani: Yup, LGTM. [18:07:58] James_F: IS then the dblist? [18:08:11] for sync order [18:08:11] thcipriani: Yes please. [18:08:17] okie doke, going everywhere [18:09:57] !log T133395: Restarting xenon.eqiad.wmnet to apply https://gerrit.wikimedia.org/r/314603 [18:09:58] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [18:10:20] (03PS5) 10Dzahn: Gerrit: Fix reviewer-counts.json cronspam by removing \ and " [puppet] - 10https://gerrit.wikimedia.org/r/315300 (https://phabricator.wikimedia.org/T147776) (owner: 10Paladox) [18:10:35] !log updated php on iridium [18:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:49] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:315277|Enable the visual editor for logged-in users on remaining phase 6 Wikipedias (T142589)]] PART I (duration: 01m 56s) [18:11:50] T142589: Enable VisualEditor by default for all users of all remaining non-language variant Wikipedias - https://phabricator.wikimedia.org/T142589 [18:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:10] !log thcipriani@mira Synchronized dblists/visualeditor-nondefault.dblist: SWAT: [[gerrit:315277|Enable the visual editor for logged-in users on remaining phase 6 Wikipedias (T142589)]] PART II (duration: 00m 59s) [18:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:15] ^ James_F live everywhere [18:13:31] Thanks! [18:13:34] !log T133395: Restarting Cassandra in RESTBase Staging to apply https://gerrit.wikimedia.org/r/314603 [18:13:37] (03CR) 10Dzahn: [C: 032] Gerrit: Fix reviewer-counts.json cronspam by removing \ and " [puppet] - 10https://gerrit.wikimedia.org/r/315300 (https://phabricator.wikimedia.org/T147776) (owner: 10Paladox) [18:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:40] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314643 (https://phabricator.wikimedia.org/T147544) (owner: 10Catrope) [18:13:50] mutante ^ thanks :) [18:14:05] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:14:17] (03Merged) 10jenkins-bot: Enable $wgPageTriageNoIndexUnreviewedNewArticles on all wikis that have PageTriage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314643 (https://phabricator.wikimedia.org/T147544) (owner: 10Catrope) [18:14:43] paladox: i wonder if korma is still reading it .. [18:14:54] (would be nice if it kept updating though) [18:15:02] Oh not sure, could be though [18:15:07] Yep [18:15:13] yea, finding that out is more work tahn fixing it, heh [18:15:27] something ..external contractor [18:15:27] RoanKattouw: your change is live on mw1099 [18:15:31] mutante would you be able to manually get the cron to run so we can see if it works for it? [18:15:33] please [18:15:38] https://gerrit.wikimedia.org/reviewer-counts.json [18:15:40] paladox: yea, in a minute [18:15:47] Ok thanks [18:15:48] :) [18:15:48] thcipriani: Thanks. Do we have wmf22 anywhere yet? [18:16:11] RoanKattouw: not yet, just cut, will deploy to group0 after SWAT [18:16:17] OK cool [18:16:31] Then I can't test this for real yet, will look at the $wg through eval.php though [18:16:44] eval.php checks out [18:17:01] So looks good to me, insofar as it's verifiable right now [18:17:02] okie doke. Looks like IS.php then CS.php for this sync? [18:17:33] !log lead (old gerrit) manually remove reviewer-count cron, puppet is disabled [18:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:17:42] paladox: ^ because puppet is not running there [18:17:59] Oh [18:18:04] now to cobalt [18:18:09] :) [18:18:12] thanks [18:18:47] !log cobalt (new gerrit) run reviewer-count cron, works now [18:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:54] paladox: open it [18:19:05] Ok [18:19:08] thanks [18:19:19] Still shows the error [18:19:26] so i suppose it is still updating [18:19:30] paladox: browser cache? [18:19:42] can't confirm when looking at the file on server [18:19:45] Yep [18:19:49] Browser cache [18:19:50] it works [18:19:53] 'k :) [18:20:03] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:314643|Enable $wgPageTriageNoIndexUnreviewedNewArticles on all wikis that have PageTriage (T147544)]] PART I (duration: 00m 50s) [18:20:04] T147544: Unreviewed new articles on English Wikipedia should be marked as noindex - https://phabricator.wikimedia.org/T147544 [18:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:42] 06Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2706670 (10Dzahn) [18:20:43] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [18:20:44] 06Operations, 10Gerrit, 13Patch-For-Review: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2706666 (10Dzahn) 05Open>03Resolved a:03Dzahn 11:18 < mutante> !log lead (old gerrit) manually remove reviewer-count cron, puppet is disabled 11:19 < mutante> !log co... [18:21:03] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:21:31] mutante the file is now 20.3mb [18:22:47] paladox: is that unusual? [18:22:55] No [18:23:01] Just confirming it works :) [18:23:06] heh,ok [18:25:37] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2412705 (10Dzahn) [18:25:40] 06Operations, 06Security-Team, 10vm-requests: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#2706705 (10Dzahn) 05stalled>03declined Closing this as the related access request T138873 has been declined. Should be reopened together with that. [18:26:25] RoanKattouw: I halted the sync of CommonSettings.php. I started to see undefined variable notice on canaries: https://logstash.wikimedia.org/goto/1b70fc9b39d98a8d3e6440763ac8e72e but I have no idea why that would happen. [18:26:38] !log T133395: Starting dumps (3) in RESTBase Staging [18:26:39] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [18:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:26:44] 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2706716 (10Dzahn) we worked with it during the offsite. Moritz added a script to write the first 5 test keys and then we did. now some more testing is going to happen, and meanwhile my task here is to bring the laptop ba... [18:27:45] RoanKattouw: there was a momentary spike of those errors, but it seems it was somehow ephemeral. I can't think of a reason why that would happen. Should I continue syncing CommonSettings.php? [18:29:21] (03CR) 10Rush: [C: 032] labsdb: apply candidate puppet logic for maintain-views to labsdb1008 [puppet] - 10https://gerrit.wikimedia.org/r/315319 (owner: 10Rush) [18:29:25] (03PS2) 10Rush: labsdb: apply candidate puppet logic for maintain-views to labsdb1008 [puppet] - 10https://gerrit.wikimedia.org/r/315319 [18:29:30] (03CR) 10Rush: [V: 032] labsdb: apply candidate puppet logic for maintain-views to labsdb1008 [puppet] - 10https://gerrit.wikimedia.org/r/315319 (owner: 10Rush) [18:30:17] thcipriani: It probably synced the files in the wrong order [18:30:39] InitialiseSettings.php creates a var that CommonSettings.php uses [18:30:46] RoanKattouw: that's what's strange, I sync'd out InitialiseSettings.php before I sync'd CommonSettings.php [18:30:55] So Init should have been synced before Common [18:30:58] Hah, really? [18:31:12] Then I don't understand why you'd get those errors at all [18:31:14] What were the errors? [18:31:17] And you say they stopped? [18:31:35] I halted deployment while it was waiting for the canary check [18:31:39] since I saw the spike [18:31:52] RoanKattouw: https://logstash.wikimedia.org/goto/1b70fc9b39d98a8d3e6440763ac8e72e [18:32:25] I would have expected the errors to continue if there were actually a problem, but I still have no idea why those errors happened. [18:33:44] !log T133395: Restarting Cassandra in RESTBase (codfw) to apply https://gerrit.wikimedia.org/r/314603 [18:33:46] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [18:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:43] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:37:31] 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2706738 (10RobH) 05Open>03Resolved I got it refunded back to my card. [18:37:47] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:40:35] !log thcipriani@mira Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:314643|Enable $wgPageTriageNoIndexUnreviewedNewArticles on all wikis that have PageTriage (T147544)]] PART II (duration: 00m 52s) [18:40:36] T147544: Unreviewed new articles on English Wikipedia should be marked as noindex - https://phabricator.wikimedia.org/T147544 [18:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:54] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [18:40:54] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [18:40:54] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [18:41:03] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [18:41:03] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 28 ESP OK [18:41:03] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 28 ESP OK [18:41:03] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 28 ESP OK [18:41:03] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK [18:41:03] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [18:41:03] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [18:41:04] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [18:41:05] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [18:41:05] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 44 ESP OK [18:41:05] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 44 ESP OK [18:41:06] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 28 ESP OK [18:41:43] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 36 ESP OK [18:41:43] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [18:41:44] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [18:41:44] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 28 ESP OK [18:41:45] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK [18:41:45] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [18:41:55] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [18:41:55] the language called "Olo" has language code "ong", the language that has code "olo" is not Olo :p [18:41:59] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [18:41:59] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [18:41:59] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 44 ESP OK [18:42:00] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [18:42:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [18:42:09] mutante it's https://en.wikipedia.org/wiki/Livvi-Karelian_language [18:42:10] as you may have noticed kafka 1018 is up again but we might need to reboot again [18:42:14] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 44 ESP OK [18:42:14] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [18:42:14] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 28 ESP OK [18:42:14] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [18:42:14] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [18:42:14] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 44 ESP OK [18:42:14] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [18:42:15] still working on it [18:42:15] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [18:42:15] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [18:42:16] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [18:42:16] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [18:42:17] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [18:42:20] paladox: yes [18:42:22] mutante, ololo [18:42:27] hehe [18:42:29] LOL [18:42:35] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 36 ESP OK [18:42:35] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [18:42:35] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [18:42:42] MaxSem: i need the name of that language in the language itself [18:42:43] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [18:42:43] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [18:42:43] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [18:42:43] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [18:42:51] MaxSem: should i use Cyrillic? [18:42:53] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 36 ESP OK [18:42:53] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [18:42:53] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 36 ESP OK [18:42:55] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [18:43:04] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 44 ESP OK [18:43:04] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [18:43:04] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [18:43:04] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [18:43:04] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK [18:43:04] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [18:43:05] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 28 ESP OK [18:43:05] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [18:43:05] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [18:43:06] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [18:43:06] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [18:43:27] mutante, Olo or olo? [18:43:54] MaxSem: olo, Livvi-Karelian [18:44:21] the translation of Livvi-Karelian in Livvi-Karelian [18:44:49] ливвиковский язык says it's Russian [18:44:49] MaxSem theres an lanaguage called olo but different lang code, and the above with the olo lang code [18:44:51] KLOL [18:44:53] LOL [18:44:56] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:45:07] Why didn't they give the olo code to the olo lanaguage [18:45:08] hmm, English WP says it prolly uses Cyrillic, Russian disagrees and says Latin [18:45:24] https://en.wikipedia.org/wiki/Olo_language [18:45:25] Writing System says "citation needed" heh [18:45:25] rebooting again kafka1018 [18:45:31] there might be some turbolence [18:45:43] but I don't know how to shut the IPsec alarms [18:46:37] elukey: FTFY [18:47:51] mutante: the gerrit cronspam?? I was reading the updates, thanks! [18:47:55] ru: says the self-name is livvin kieli, however that sounds suspicously Finnish - what does Nikerabbit think? [18:48:02] elukey: the icinga bot :) [18:48:29] i'll just make it join again when you're done [18:48:36] mutante the 20.3mb when i try viewing it, it freezes my notepadd ++ and sublime text too LOL [18:48:39] Must be really big [18:48:52] but then again it caused the browser to when i was trying to paste it in [18:49:05] paladox: wget? [18:49:19] what are you going to do with it [18:49:27] wget is not available on windows 10 [18:49:39] I downloaded the file when i went to the site [18:49:42] paladox: thought you have bash and ubuntu on windows [18:49:45] it imeditly started downloading [18:49:46] MaxSem: how does my opinion matter... the language is very similar to Finnish [18:49:46] Yes [18:50:35] https://fi.wikipedia.org/wiki/Aunuksenkarjalan_kieli [18:50:55] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2706765 (10RobH) a:05jcrespo>03Cmjohnson Chris, I'll escalate this to our account team, but can you dispatch ov... [18:51:02] not to be confused with https://fi.wikipedia.org/wiki/Liivin_kieli [18:51:46] * paladox does not speak finnish [18:51:50] http://www-01.sil.org/iso639-3/documentation.asp?id=olo says just "Livvi" [18:52:17] would expect ISO 639-3 to have local names too, hrmmm [18:52:26] MaxSem: i'll use "livvin kieli" i gues [18:52:40] where is that name needed? [18:52:49] statistics tables [18:53:08] https://github.com/wikimedia/jquery.uls/blob/master/data/langdb.yaml#L368 has yet different name [18:53:48] heh [18:53:53] mutante: aaahhh okok! kafka1018 is up [18:54:02] sorry I didn't get :) [18:54:07] I can restart irc-eco [18:54:38] elukey: ah, done [18:54:48] super! [18:55:02] !log kafka1018 back in service after maintenance [18:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:23] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:55:37] MaxSem: Nikerabbit: thanks, we can always change it once we find a native speaker, heh [18:57:14] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:31] ^^^ on that. [18:58:03] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:59:31] !log restarting restbase: restbase2004.codfw.wmnet [18:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161011T1900). [19:00:34] * thcipriani does [19:01:51] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [19:03:25] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [19:05:24] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [19:06:19] !log thcipriani@mira Started scap: testwiki to 1.28.0-wmf.22 and rebuild l10n cache [19:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:35] 06Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 13Patch-For-Review, 07Regression: Favicon broken on doc.wikimedia.org and integration.wikimedia.org (HTTP 500) - https://phabricator.wikimedia.org/T147814#2706944 (10Krinkle) 05Open>03Resolved a:03Krinkle [19:11:29] (03PS1) 10Rush: labsdb: move init to labsdb::views [puppet] - 10https://gerrit.wikimedia.org/r/315327 [19:11:58] (03PS2) 10Rush: labsdb: move init to labsdb::views [puppet] - 10https://gerrit.wikimedia.org/r/315327 [19:15:06] (03CR) 10Yuvipanda: [C: 031] labsdb: move init to labsdb::views [puppet] - 10https://gerrit.wikimedia.org/r/315327 (owner: 10Rush) [19:15:22] (03CR) 10Rush: [C: 032] labsdb: move init to labsdb::views [puppet] - 10https://gerrit.wikimedia.org/r/315327 (owner: 10Rush) [19:19:08] (03PS1) 10Yuvipanda: notebook: Provision researcher acceounts on notebook servers [puppet] - 10https://gerrit.wikimedia.org/r/315329 [19:19:10] (03PS1) 10Yuvipanda: authdns: Move roles to module [puppet] - 10https://gerrit.wikimedia.org/r/315330 [19:20:04] (03PS1) 10Rush: labsdb: split off labsdb1008 node def [puppet] - 10https://gerrit.wikimedia.org/r/315331 [19:20:15] (03PS2) 10Rush: labsdb: split off labsdb1008 node def [puppet] - 10https://gerrit.wikimedia.org/r/315331 [19:20:54] (03CR) 10Yuvipanda: [C: 031] labsdb: split off labsdb1008 node def [puppet] - 10https://gerrit.wikimedia.org/r/315331 (owner: 10Rush) [19:21:36] (03CR) 10Rush: [C: 032] labsdb: split off labsdb1008 node def [puppet] - 10https://gerrit.wikimedia.org/r/315331 (owner: 10Rush) [19:21:41] (03PS2) 10Yuvipanda: authdns: Move roles to module [puppet] - 10https://gerrit.wikimedia.org/r/315330 [19:22:56] mutante: ^ I'm going to spend like an hour doing some cleanup in manifests/role too [19:24:22] (03CR) 10Yuvipanda: [C: 032] authdns: Move roles to module [puppet] - 10https://gerrit.wikimedia.org/r/315330 (owner: 10Yuvipanda) [19:24:28] (03CR) 10Yuvipanda: "https://puppet-compiler.wmflabs.org/4298/ noop says pc" [puppet] - 10https://gerrit.wikimedia.org/r/315330 (owner: 10Yuvipanda) [19:24:32] (03CR) 10Volans: "Faidon: I've see those failures on swift machines with 14 disks, when the load is very close to the number of cores, so the hpssacli calls" [puppet] - 10https://gerrit.wikimedia.org/r/315103 (owner: 10Filippo Giunchedi) [19:24:34] (03PS3) 10Yuvipanda: authdns: Move roles to module [puppet] - 10https://gerrit.wikimedia.org/r/315330 [19:24:39] (03CR) 10Yuvipanda: [V: 032] authdns: Move roles to module [puppet] - 10https://gerrit.wikimedia.org/r/315330 (owner: 10Yuvipanda) [19:26:50] (03PS1) 10Yuvipanda: Move role::puppet::self into role module [puppet] - 10https://gerrit.wikimedia.org/r/315333 [19:27:18] yuvitest: cool :) [19:28:04] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2707027 (10Dzahn) mailed the ops list with a summary of this. comments welcome [19:28:06] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2707028 (10Dzahn) mailed the ops list with a summary of this. comments welcome [19:28:14] mutante: when / if you have time, do you wanna look at role::mariadb? :D biggest one left, I think [19:28:20] I'll probably be able to finish all the others today [19:28:23] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:43] and when that's all done, we can then move all of manifests/role into modules/role/manifets and call it done :) [19:28:50] (03PS2) 10Yuvipanda: Move role::puppet::self into role module [puppet] - 10https://gerrit.wikimedia.org/r/315333 [19:28:54] (03PS1) 10Rush: labsdb: move heartbeat-views.sql to files [puppet] - 10https://gerrit.wikimedia.org/r/315334 [19:28:56] (03CR) 10Yuvipanda: [C: 032 V: 032] Move role::puppet::self into role module [puppet] - 10https://gerrit.wikimedia.org/r/315333 (owner: 10Yuvipanda) [19:29:25] (03CR) 10jenkins-bot: [V: 04-1] labsdb: move heartbeat-views.sql to files [puppet] - 10https://gerrit.wikimedia.org/r/315334 (owner: 10Rush) [19:29:46] (03PS2) 10Rush: labsdb: move heartbeat-views.sql to files [puppet] - 10https://gerrit.wikimedia.org/r/315334 [19:30:50] yuvitest: ok, i can make a patch, but i will need reviews from dba people [19:31:03] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:31:18] mutante: since it's a plain move, I feel ok if we run it through puppet compiler and merge if it's all noops [19:31:48] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:04] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:05] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:05] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [19:32:15] ok, let's try [19:32:24] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:24] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:35] :/ [19:32:44] (03CR) 10Rush: [C: 032] labsdb: move heartbeat-views.sql to files [puppet] - 10https://gerrit.wikimedia.org/r/315334 (owner: 10Rush) [19:33:05] (03PS1) 10Yuvipanda: Move statistics role into separate files on modules [puppet] - 10https://gerrit.wikimedia.org/r/315335 [19:33:06] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:33:06] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:33:09] yuvitest: I caught two of your merges [19:33:15] Yuvipanda: Move role::puppet::self into role module (ad721ab) [19:33:15] Yuvipanda: authdns: Move roles to module (81f367a) [19:33:17] ok to go? [19:33:20] oh, i see why, there is role::mariadb [19:33:25] (03CR) 10jenkins-bot: [V: 04-1] Move statistics role into separate files on modules [puppet] - 10https://gerrit.wikimedia.org/r/315335 (owner: 10Yuvipanda) [19:33:34] chasemp: whoops, yes [19:33:49] going [19:33:49] chasemp: unrelated, did you see my message on https://gerrit.wikimedia.org/r/#/c/315311/ after merged? :) [19:34:10] volans: yes already done I think https://gerrit.wikimedia.org/r/#/c/315317/ [19:34:12] oh sorry already one [19:34:13] unless there is more I missed [19:34:16] just saw the email :D [19:34:28] yuvitest: not sure what to do with role::mariadb as opposed to role::mariadb::foo etc [19:34:36] that was it, sorry, I should have checked emails before asking ;) [19:34:37] thanks [19:34:45] maybe role::mariadb::server [19:34:54] volans: no worries thanks for bringing it up [19:34:54] since that is the role description [19:34:59] (03PS2) 10Yuvipanda: Move statistics role into separate files on modules [puppet] - 10https://gerrit.wikimedia.org/r/315335 [19:35:19] mutante: is it directly being used anywhere? [19:35:27] ACKNOWLEDGEMENT - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) eevans Investigating. [19:35:48] yuvitest: i'll find out, move on with the statistic ones:) [19:35:54] (03CR) 10jenkins-bot: [V: 04-1] Move statistics role into separate files on modules [puppet] - 10https://gerrit.wikimedia.org/r/315335 (owner: 10Yuvipanda) [19:37:00] (03CR) 10Alex Monk: Follow-up Ifa2cc187: Add ShortUrl support on wikimedia.org docroot sites (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311647 (https://phabricator.wikimedia.org/T146014) (owner: 10Alex Monk) [19:37:15] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: Connection refused [19:38:22] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: Connection refused eevans Restarting. [19:39:54] (03PS3) 10Yuvipanda: Move statistics role into separate files on modules [puppet] - 10https://gerrit.wikimedia.org/r/315335 [19:41:03] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [19:41:03] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [19:41:41] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [19:41:41] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [19:42:43] (03PS4) 10Yuvipanda: Move statistics role into separate files on modules [puppet] - 10https://gerrit.wikimedia.org/r/315335 [19:42:44] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [19:42:45] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:42:45] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [19:43:07] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [19:43:07] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [19:44:26] (03CR) 10Yuvipanda: [C: 032] Move statistics role into separate files on modules [puppet] - 10https://gerrit.wikimedia.org/r/315335 (owner: 10Yuvipanda) [19:44:31] (03CR) 10Yuvipanda: [V: 032] "https://puppet-compiler.wmflabs.org/4301/ noop" [puppet] - 10https://gerrit.wikimedia.org/r/315335 (owner: 10Yuvipanda) [19:44:56] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on port 9042 [19:48:54] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:48:55] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:55] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:49:34] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:49:34] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:49:35] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:37] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 500 (expecting: 200) [19:50:45] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:45] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:49] (03PS6) 10Andrew Bogott: Move toollabs node classes to roles. [puppet] - 10https://gerrit.wikimedia.org/r/314180 (https://phabricator.wikimedia.org/T147233) [19:51:04] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [19:51:04] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:51:04] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:51:27] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:51:46] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:51:48] !log thcipriani@mira Finished scap: testwiki to 1.28.0-wmf.22 and rebuild l10n cache (duration: 45m 28s) [19:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:20] (03PS1) 10Yuvipanda: Move logging.pp roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315341 [19:52:29] looking ^^^ [19:53:15] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [19:53:15] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:53:39] (03PS1) 10Dzahn: mariadb: split role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) [19:53:45] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:53:45] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:53:56] yuvitest: ^ wow, so many classes in there [19:54:10] now to compile it on ... * [19:54:27] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:54:29] did .22 just die xD [19:54:36] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:54:51] mutante: :D [19:54:53] (03CR) 10Andrew Bogott: [C: 032] Move toollabs node classes to roles. [puppet] - 10https://gerrit.wikimedia.org/r/314180 (https://phabricator.wikimedia.org/T147233) (owner: 10Andrew Bogott) [19:55:40] !log restarting restbase in codfw [19:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:01] (03PS2) 10Dzahn: mariadb: split role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) [19:56:05] (03CR) 10jenkins-bot: [V: 04-1] Move logging.pp roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315341 (owner: 10Yuvipanda) [19:57:14] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [19:58:43] (03PS3) 10Dzahn: mariadb: split role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) [19:59:01] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [19:59:01] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [19:59:05] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:12] (03PS1) 10Thcipriani: Group0 to 1.28.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315345 [19:59:14] (03PS2) 10Yuvipanda: Move logging.pp roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315341 [19:59:28] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [19:59:28] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [19:59:57] (03CR) 10Thcipriani: [C: 032] Group0 to 1.28.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315345 (owner: 10Thcipriani) [20:00:14] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [20:00:15] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [20:00:15] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [20:00:27] (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315345 (owner: 10Thcipriani) [20:00:43] (03PS4) 10Dzahn: mariadb: split role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) [20:00:56] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [20:01:17] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [20:01:17] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [20:01:49] !log thcipriani@mira rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.22 [20:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:56] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1022 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:01:56] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:02:20] (03PS3) 10Yuvipanda: Move logging.pp roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315341 [20:02:38] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1020 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:03:15] !log T133395: Restarting Cassandra: restbase2008-c [20:03:16] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [20:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:03] GAH not sure what is happening w mirror maker [20:05:05] am troubleshooting [20:05:09] this cafe is clossing [20:05:11] back ina sec [20:05:52] (03PS4) 10Yuvipanda: Move logging.pp roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315341 [20:06:24] !log repooling mobileapps on scb1001 [20:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:35] Could I get some help from anyone from ops? could you please repool mobileapps on scb1001? It didn't get repooled during a scap deploy [20:08:39] 06Operations, 06Discovery, 06Discovery-Analysis, 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2707104 (10debt) [20:08:45] I don't have enough rights [20:08:46] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2700737 (10debt) [20:09:51] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:09:53] (03PS5) 10Yuvipanda: Move logging.pp roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315341 [20:09:58] (03CR) 10Yuvipanda: [C: 032] Move logging.pp roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315341 (owner: 10Yuvipanda) [20:10:01] (03CR) 10Yuvipanda: [V: 032] Move logging.pp roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315341 (owner: 10Yuvipanda) [20:10:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:12:28] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [20:12:28] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:13:23] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2707117 (10debt) p:05Triage>03Normal [20:15:14] (03PS1) 10Yuvipanda: Move dumps roles to role module [puppet] - 10https://gerrit.wikimedia.org/r/315350 [20:18:20] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:22:59] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:06] Is anyone from ops here? We need a little help here [20:23:09] 06Operations, 10Domains, 10Traffic: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707151 (10Zppix) [20:24:01] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/4307/" [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [20:24:18] (03PS2) 10Yuvipanda: Move dumps roles to role module [puppet] - 10https://gerrit.wikimedia.org/r/315350 [20:25:27] yuvitest: are you around? [20:25:35] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1022 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:25:35] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:25:46] Pchelolo: trying to fix this ^ [20:25:48] but what's up? [20:26:08] ottomata: awesome! you're here. Could you help us with your ops power and repool mobileapps on scb1001? [20:26:20] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1020 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:26:50] it didn't get repooled during the deploy with scp3 [20:27:05] looking... [20:27:13] 06Operations, 10Domains, 10Traffic: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707138 (10Krenair) I think so yes. Why is this #Domains? [20:27:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:27:47] ottomata: oh, actually, maybe I'm asking for a wrong thing [20:28:13] (03PS3) 10Yuvipanda: Move dumps roles to role module [puppet] - 10https://gerrit.wikimedia.org/r/315350 [20:28:16] https://config-master.wikimedia.org/conftool/eqiad/mobileapps [20:28:18] (03CR) 10Yuvipanda: [C: 032 V: 032] "https://puppet-compiler.wmflabs.org/4309/ noop" [puppet] - 10https://gerrit.wikimedia.org/r/315350 (owner: 10Yuvipanda) [20:28:21] scb1001 is depooled [20:28:24] so i should repool it? [20:28:49] ottomata: I'm not 100% sure, but mobileapps are having troubles on scb1002 [20:29:30] ottomata from what im gathering from Pchelolo im assuming that he means to repool it [20:29:41] 06Operations, 10Domains, 10Traffic: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707157 (10jeremyb) see also {T88859}, cc @Slaporte [20:29:54] according to server admin logs: 12:39 moritzm: nodejs reverted to 4.4.6 on scb1001, depooling for service restarts [20:30:23] but there it never a repooled. Was that intentional? [20:30:23] 06Operations, 10Domains, 10Traffic: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707164 (10Zppix) @jeremyb its restricted [20:31:13] hm [20:31:18] yeah i guess only moritzm would know [20:31:22] i can repool if you like [20:31:38] say yes! [20:31:38] (03PS1) 10Yuvipanda: Move wdq_mm roles to role module [puppet] - 10https://gerrit.wikimedia.org/r/315351 [20:31:40] i will do [20:31:40] ottomata: CPU on scb1002 is almost at 100, so let's repool? [20:31:42] ok [20:31:43] doig [20:31:44] doing [20:32:25] (03PS2) 10Yuvipanda: Move wdq_mm roles to role module [puppet] - 10https://gerrit.wikimedia.org/r/315351 [20:33:19] done Pchelolo [20:33:28] !log repooled scb1001 for mobileapps [20:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:41] thank you ottomata, I'll be monitoring the situation [20:33:45] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [20:34:02] (03CR) 10Yuvipanda: [C: 032] Move wdq_mm roles to role module [puppet] - 10https://gerrit.wikimedia.org/r/315351 (owner: 10Yuvipanda) [20:34:20] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:35:08] Pchelolo: ottomata : I just looked at other services (ores, mathoid), and their config is the same, looks like they are depooled as well (e.g. https://config-master.wikimedia.org/conftool/eqiad/mathoid) [20:35:18] urandom: ^ [20:35:52] Pchelolo: I can repool them all if you say I should [20:36:26] (03PS1) 10Yuvipanda: Move wikilabels roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315353 [20:37:00] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:37:03] ottomata: others don't experience troubles as far as I can tell, so let's leave it as is? maybe that was intentional [20:37:19] 06Operations, 10Domains, 10Traffic: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707138 (10Dzahn) see T44085 ? [20:37:45] (03PS2) 10Yuvipanda: Move wikilabels roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315353 [20:37:51] (03CR) 10Yuvipanda: [C: 032] Move wikilabels roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315353 (owner: 10Yuvipanda) [20:37:53] (03CR) 10Yuvipanda: [V: 032] Move wikilabels roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/315353 (owner: 10Yuvipanda) [20:38:29] Pchelolo: he probably depooled all of scb1001 to do the nodejs version change [20:38:31] and just forgot [20:38:54] 06Operations, 10Domains, 10Traffic: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707198 (10Zppix) [20:39:59] ottomata: ye, probably that's what happened [20:40:14] 06Operations, 10Domains, 10Traffic: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707214 (10Dzahn) https://w.wiki/ -> https://meta.wikimedia.org/wiki/Special:UrlShortener though "Creating new short URLs is temporarily disabled. " [20:40:48] (03PS2) 10Andrew Bogott: Rename role::labs::tools::* to role::toollabs::* [puppet] - 10https://gerrit.wikimedia.org/r/315302 [20:42:42] (03CR) 10Andrew Bogott: [C: 032] Rename role::labs::tools::* to role::toollabs::* [puppet] - 10https://gerrit.wikimedia.org/r/315302 (owner: 10Andrew Bogott) [20:43:08] (03PS1) 10Yuvipanda: Move restbase role into role module [puppet] - 10https://gerrit.wikimedia.org/r/315367 [20:43:12] (03PS3) 10Dzahn: gerrit: remove backup::host, rsyncd include from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/314767 (https://phabricator.wikimedia.org/T147597) [20:46:02] 06Operations, 10Domains, 10Traffic: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707233 (10jeremyb) >>! In T147887#2707164, @Zppix wrote: > @jeremyb its restricted yes, sorry, I hadn't noticed that. [20:46:09] http://www.btwifi.co.uk/ [20:46:11] Woops [20:46:16] Sorry wrong place [20:46:19] (03PS2) 10Yuvipanda: Move restbase role into role module [puppet] - 10https://gerrit.wikimedia.org/r/315367 [20:48:54] (03CR) 10Dzahn: [C: 032] gerrit: remove backup::host, rsyncd include from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/314767 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [20:49:50] (03PS1) 10Ottomata: Set acks=1 for analytics kafka mirror instances [puppet] - 10https://gerrit.wikimedia.org/r/315400 [20:50:09] (03PS2) 10Dzahn: gerrit: mv standard incl to role, rm duplicate firewall [puppet] - 10https://gerrit.wikimedia.org/r/314768 (https://phabricator.wikimedia.org/T147597) [20:50:15] (03PS3) 10Dzahn: gerrit: mv standard incl to role, rm duplicate firewall [puppet] - 10https://gerrit.wikimedia.org/r/314768 (https://phabricator.wikimedia.org/T147597) [20:50:28] (03PS3) 10Yuvipanda: Move restbase role into role module [puppet] - 10https://gerrit.wikimedia.org/r/315367 [20:51:05] (03CR) 10Yuvipanda: [C: 032 V: 032] Move restbase role into role module [puppet] - 10https://gerrit.wikimedia.org/r/315367 (owner: 10Yuvipanda) [20:51:22] (03CR) 10Ottomata: [C: 032] Set acks=1 for analytics kafka mirror instances [puppet] - 10https://gerrit.wikimedia.org/r/315400 (owner: 10Ottomata) [20:51:28] (03PS2) 10Ottomata: Set acks=1 for analytics kafka mirror instances [puppet] - 10https://gerrit.wikimedia.org/r/315400 [20:51:32] (03CR) 10Ottomata: [V: 032] Set acks=1 for analytics kafka mirror instances [puppet] - 10https://gerrit.wikimedia.org/r/315400 (owner: 10Ottomata) [20:54:56] host mw1307 is spamming fatalmonitor [20:54:58] (03CR) 10Bmansurov: [C: 031] Disable bottom language button in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315314 (https://phabricator.wikimedia.org/T143829) (owner: 10Jdlrobson) [20:55:05] far and away the loudest host [20:55:06] (03PS1) 10Andrew Bogott: Qualify a bunch of references to ::toollabs classes [puppet] - 10https://gerrit.wikimedia.org/r/315412 [20:55:47] (03PS1) 10Rush: maintain-views: externalize dependencies [puppet] - 10https://gerrit.wikimedia.org/r/315413 [20:55:49] greg-g mw1307 must want some fatalites [20:56:52] (03CR) 10Andrew Bogott: [C: 032] Qualify a bunch of references to ::toollabs classes [puppet] - 10https://gerrit.wikimedia.org/r/315412 (owner: 10Andrew Bogott) [20:57:11] nvm, it was just at one moment, a ton of Cannot access the database: Unknown error (10.64.16.102) [20:57:16] (03CR) 10jenkins-bot: [V: 04-1] maintain-views: externalize dependencies [puppet] - 10https://gerrit.wikimedia.org/r/315413 (owner: 10Rush) [20:57:21] (03PS2) 10Rush: maintain-views: externalize dependencies [puppet] - 10https://gerrit.wikimedia.org/r/315413 [20:57:26] all commonswiki [20:57:58] commons is causing issues... about time xD [20:58:22] (03CR) 10Dzahn: "can't compile, cobalt isnt know by compiler" [puppet] - 10https://gerrit.wikimedia.org/r/314768 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [20:58:42] !log T133395: Restart Cassandra on restbase2005-a.codfw.wmnet [20:58:43] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [20:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:52] (03PS4) 10Dzahn: gerrit: mv standard incl to role, rm duplicate firewall [puppet] - 10https://gerrit.wikimedia.org/r/314768 (https://phabricator.wikimedia.org/T147597) [20:58:59] (03CR) 10jenkins-bot: [V: 04-1] maintain-views: externalize dependencies [puppet] - 10https://gerrit.wikimedia.org/r/315413 (owner: 10Rush) [21:00:52] 06Operations, 10OTRS: Upgrade OTRS to 5.0.13 - https://phabricator.wikimedia.org/T147397#2707289 (10akosiaris) 05Open>03Resolved I 'll tentatively re-resolve this. Feel free to reopen [21:01:06] 06Operations: dubnium disk full - https://phabricator.wikimedia.org/T147173#2707299 (10akosiaris) 05Open>03Resolved [21:01:25] Zppix: comments like that are offtopic and unhelpful in this channel [21:01:38] (03CR) 10Dzahn: [C: 032] gerrit: mv standard incl to role, rm duplicate firewall [puppet] - 10https://gerrit.wikimedia.org/r/314768 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [21:03:34] (03PS3) 10Rush: maintain-views: externalize dependencies [puppet] - 10https://gerrit.wikimedia.org/r/315413 [21:06:20] !log T133395: Restart Cassandra on restbase2005-b.codfw.wmnet [21:06:21] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [21:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:11] (03PS4) 10Rush: maintain-views: externalize dependencies [puppet] - 10https://gerrit.wikimedia.org/r/315413 [21:08:05] ottomata, Pchelolo: indeed, thanks for fixing [21:08:34] morebots: i only repooled mobileapps [21:08:34] I am a logbot running on tools-exec-1219. [21:08:34] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [21:08:34] To log a message, type !log . [21:08:45] (03CR) 10Rush: [C: 032] maintain-views: externalize dependencies [puppet] - 10https://gerrit.wikimedia.org/r/315413 (owner: 10Rush) [21:08:47] should we repool all services on sbb1001 ? [21:08:53] sorry moritzm^^ [21:10:19] !log T133395: Restart Cassandra on restbase2005-c.codfw.wmnet [21:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:11] ottomata: yeah, I'll do that [21:13:48] we had a bit of a back and forth until it was confirmed that we need to revert to 4.4.6 [21:14:18] (03PS1) 10Dzahn: gerrit: remove lead from site.pp, adjust comment [puppet] - 10https://gerrit.wikimedia.org/r/315418 (https://phabricator.wikimedia.org/T147597) [21:14:56] !log repooling all services on scb1001 after earlier revert to nodejs 4.4.6 [21:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:13] !log T133395: Restarting Cassandra instances on restbase2006.codfw.wmnet [21:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:17:02] (03PS1) 10Andrew Bogott: Qualify more classes to distinguish them from roles [puppet] - 10https://gerrit.wikimedia.org/r/315419 [21:17:09] thanks [21:18:13] (03CR) 10Andrew Bogott: [C: 032] Qualify more classes to distinguish them from roles [puppet] - 10https://gerrit.wikimedia.org/r/315419 (owner: 10Andrew Bogott) [21:18:45] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [21:20:12] (03PS1) 10Rush: labsdb: maintain-views mediawiki-config checkout [puppet] - 10https://gerrit.wikimedia.org/r/315421 [21:21:32] (03PS2) 10Rush: labsdb: maintain-views mediawiki-config checkout [puppet] - 10https://gerrit.wikimedia.org/r/315421 [21:24:11] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:17] (03CR) 10Rush: [C: 032] labsdb: maintain-views mediawiki-config checkout [puppet] - 10https://gerrit.wikimedia.org/r/315421 (owner: 10Rush) [21:26:52] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [21:29:25] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:31] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:54] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [21:32:11] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [21:33:32] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:54] urandom: ^ [21:35:09] yuvitest: yeah, sorry, looking at it [21:36:14] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [21:38:45] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2707523 (10RobH) I would think that using @emailbot is non-ideal. I use it daily for my work in S4 with vendors, and just getting it to work there is hit and miss. It needs quite a bit more tweaks for the limited sub-set... [21:51:07] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2707542 (10yuvipanda) Since it's just a matter of packages, I am pretty sure we can use the same puppet code. We'll just import the up... [22:31:33] (03CR) 10EBernhardson: "do we still need this for T147495 (zh/ja/th test)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315250 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [22:35:26] (03CR) 10EBernhardson: [C: 031] "Looks good. verified similarity profile change will only effect index building." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315297 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [22:36:45] (03PS2) 10Dzahn: gerrit: remove lead from site.pp, adjust comment [puppet] - 10https://gerrit.wikimedia.org/r/315418 (https://phabricator.wikimedia.org/T147597) [22:37:57] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2707669 (10GWicke) @brion, this task looks pretty stalled by now. Do you see a chance for reviving it any t... [22:50:06] (03PS1) 10EBernhardson: Set defaults for wgCirrusSearchClusterOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315443 [22:52:08] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161011T2300). [23:00:05] Jdlrobson, Krinkle, tgr, Krenair, and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:11] \o present [23:00:17] \o [23:00:41] i suppose i can start shipping things [23:01:13] * ebernhardson hopes the slowness of gerrit responding to me opening a few tabs isn't a bad sign... [23:01:58] ostriches: is gerrit unhappy today? i just got a cannot be reached [23:02:32] ebernhardson: news to me if it is :) [23:02:35] we appear to have 9 patches in an 8 patch window [23:04:15] ostriches: i've gotten multiple timeouts or server unavailables now opening up tabs for the patches in SWAT :S [23:04:50] yeah, i can't edit a patch anymore [23:04:51] (03PS2) 10EBernhardson: Disable bottom language button in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315314 (https://phabricator.wikimedia.org/T143829) (owner: 10Jdlrobson) [23:04:57] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:05:04] ebernhardson: has been doing that occasionally since morning [23:05:29] WFM [23:05:56] ebernhardson: hey could you do the non-config change first? [23:06:19] jdlrobson: sure [23:06:21] and whoops i just saw i didnt check it out to the right branches :-S [23:06:35] yea i just cherry picked it to wmf.22 [23:06:42] assuming that's the right one? [23:06:43] (03CR) 10Dzahn: [C: 032] gerrit: remove lead from site.pp, adjust comment [puppet] - 10https://gerrit.wikimedia.org/r/315418 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [23:06:51] thanks ebernhardson - i think we need both though [23:07:20] yep wmf.21 too [23:08:19] ebernhardson: shall i cherry pick to wmf.21? [23:08:29] Krinkle: ok for robots.php deploy? [23:09:26] (03CR) 10EBernhardson: [C: 032] Set defaults for wgCirrusSearchClusterOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315443 (owner: 10EBernhardson) [23:09:53] (03Merged) 10jenkins-bot: Set defaults for wgCirrusSearchClusterOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315443 (owner: 10EBernhardson) [23:09:57] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2707712 (10Dzahn) [23:10:06] 06Operations, 10Gerrit, 10hardware-requests: Allocate spare misc box in eqiad for gerrit replacement - https://phabricator.wikimedia.org/T147596#2707714 (10Dzahn) [23:10:08] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697753 (10Dzahn) 05Open>03Resolved [23:11:36] !log lead - revoke puppet cert, node clean [23:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:16] !log pulled config change to m21099 [23:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:22] * ebernhardson can't type aparently... [23:12:32] we got a whole bunch more servers, apparently :) [23:13:26] ebernhardson: Yes [23:15:13] !log ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: Set defaults for wgCirrusSearchClusterOverrides (duration: 00m 53s) [23:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:02] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:16:26] ebernhardson: I'd like to verify in beta post-merge or on mw1099 before going everywhere. [23:16:49] Krinkle: certainly, jdlrobson's patch just merged so i'm going to ship that, then you're next [23:17:01] !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-common.php: Set defaults for wgCirrusSearchClusterOverrides (duration: 00m 56s) [23:17:03] ebernhardson: time to test? [23:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:16] jdlrobson: not yet, it just synced out my nop config change while i was waiting for your merge. soon :) [23:17:40] 06Operations, 06Performance-Team, 10scap, 07Epic: During deployment old servers may populate new cache URIs - https://phabricator.wikimedia.org/T47877#2707719 (10Krinkle) [23:18:38] !log pulled MobileFront update to mw1099 [23:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:45] jdlrobson: ok wmf.21 and wmf.22 changes are on mw1099 [23:19:54] ebernhardson: verified everything is ok on 1.28.0-wmf.21 [23:20:17] also good on 1.28.0-wmf.22 [23:20:20] sync away! [23:21:38] (03CR) 10EBernhardson: [C: 032] Disable bottom language button in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315314 (https://phabricator.wikimedia.org/T143829) (owner: 10Jdlrobson) [23:21:58] !log ebernhardson@mira Synchronized php-1.28.0-wmf.22/extensions/MobileFrontend/includes/skins/SkinMinerva.php: SWAT: Fix logic of MinervaBottomLanguageButton T143829 (duration: 00m 50s) [23:22:01] T143829: Remove unnecessary "Read in another language" button except for on Main pages - https://phabricator.wikimedia.org/T143829 [23:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:12] !log ebernhardson@mira Synchronized php-1.28.0-wmf.21/extensions/MobileFrontend/includes/skins/SkinMinerva.php: SWAT: Fix logic of MinervaBottomLanguageButton T143829 (duration: 00m 50s) [23:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:18] jdlrobson: all shipped [23:23:23] (03PS3) 10EBernhardson: Disable bottom language button in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315314 (https://phabricator.wikimedia.org/T143829) (owner: 10Jdlrobson) [23:23:27] ebernhardson: config change is live? [23:23:39] (03CR) 10EBernhardson: [C: 032] Disable bottom language button in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315314 (https://phabricator.wikimedia.org/T143829) (owner: 10Jdlrobson) [23:23:50] jdlrobson: no, the others [23:23:56] phew :) [23:23:59] config change is next, gerrit hasn't merged it yet [23:24:05] (03Merged) 10jenkins-bot: Disable bottom language button in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315314 (https://phabricator.wikimedia.org/T143829) (owner: 10Jdlrobson) [23:24:41] !log pulled config change (315314) to mw1099 [23:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:47] jdlrobson: config change is now on mw1099 [23:25:10] (03PS2) 10EBernhardson: robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314790 (owner: 10Krinkle) [23:25:54] ebernhardson: whoops [23:25:56] you can slap me [23:26:01] * ebernhardson slaps jdlrobson around a bit with a large trout [23:26:05] it's not working and i've just noticed a stupid mistake in the patch [23:26:08] it's missing the wg prefix [23:26:17] slap deserved [23:26:36] push a fix? [23:27:08] or revert and deal with it later [23:27:16] (03PS1) 10Jdlrobson: Always remember your wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315450 [23:27:17] ^ ebernhardson [23:27:32] sorry [23:27:35] (03CR) 10EBernhardson: [C: 032] Always remember your wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315450 (owner: 10Jdlrobson) [23:28:02] (03Merged) 10jenkins-bot: Always remember your wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315450 (owner: 10Jdlrobson) [23:28:49] !log pulled config change (315314) to mw1099 [23:28:58] !log pulled config change (315450) to mw1099 [23:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:04] jdlrobson: ok try again [23:29:21] ebernhardson: great [23:29:30] merge away :) [23:30:03] (03PS3) 10EBernhardson: robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314790 (owner: 10Krinkle) [23:30:13] (03CR) 10EBernhardson: [C: 032] robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314790 (owner: 10Krinkle) [23:30:35] !log ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT T143829 Disable bottom language button in Minerva (duration: 00m 50s) [23:30:37] T143829: Remove unnecessary "Read in another language" button except for on Main pages - https://phabricator.wikimedia.org/T143829 [23:30:42] (03Merged) 10jenkins-bot: robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314790 (owner: 10Krinkle) [23:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:09] !log pulled config change (314790) to mw1099 [23:31:13] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:18] Krinkle: your robots.php patch is on mw1099 now [23:31:36] tgr: you're up next with centralauth in wmf.21 [23:32:29] ebernhardson: All good. verified. [23:34:00] ebernhardson: it's a maintenance script, you can skip mw1099 [23:34:05] tgr: ok [23:34:10] thanks ebernhardson [23:34:10] !log ebernhardson@mira Synchronized w/robots.php: SWAT robots.php: Use WikiPage instead of Article class (duration: 00m 50s) [23:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:44] Krinkle: shipped out everywhere [23:35:28] James_F: around for ve tabs swat patch in wmf.22? [23:35:44] thought that was listed under my name, ebernhardson [23:35:58] Krenair: oh it might be, i was looking at the patch itself. ok you're up next [23:36:13] waiting on the centralauth patch to merge atm [23:36:25] Yes. [23:36:33] Ed wrote the patch, I approved it and James backported it. Then I agreed to get it through swat [23:36:46] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2707752 (10brion) @GWicke Yes, I think we should revive it as part of a concerted effort between Editing/Pa... [23:40:50] tgr: you're syncing out now [23:41:46] !log ebernhardson@mira Synchronized php-1.28.0-wmf.21/extensions/CentralAuth/: SWAT T147029 Add ignorestatus option for fixing stuck renames (duration: 00m 53s) [23:41:47] T147029: Global rename Gautehuus → Neuraxıs is stuck on Commons - https://phabricator.wikimedia.org/T147029 [23:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:53] tgr: all deployed [23:44:26] ebernhardson: thanks, verified [23:48:46] !log pulled ve update (315424) to mw1099 [23:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:55] Krenair: please test [23:50:50] ebernhardson, yep that fixes it [23:50:58] please sync [23:52:24] !log ebernhardson@mira Synchronized php-1.28.0-wmf.22/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: SWAT T147890 Only enable VE tabs if VE is available (duration: 00m 50s) [23:52:25] T147890: VE in non-NWE mode loads on "Edit source" pages on wmf.22 like the Template and MediaWiki namespaces - https://phabricator.wikimedia.org/T147890 [23:52:28] Krenair: all synced out [23:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:28] James_F, ^ [23:53:50] ebernhardson, thanks! [23:54:26] Thanks! [23:54:39] np [23:54:46] !log ebernhardson@mira Synchronized php-1.28.0-wmf.22/includes/ForkController.php: SWAT T147881 Call destroy method that actually exists instead of one that doesnt anymore. (duration: 00m 52s) [23:54:49] T147881: undefined method LBFactoryMulti::destroyInstance() when running extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php - https://phabricator.wikimedia.org/T147881 [23:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:27] (to self i guess :P) verified forkcontroller now works as expected on testwiki [23:57:50] (03CR) 10Chad: [C: 04-1] "I'd prefer not. The reason that other list exists is because there wasn't (previously) a way to list all the extensions. You can do that n" [puppet] - 10https://gerrit.wikimedia.org/r/315301 (owner: 10Paladox)