[00:30:54] <wikibugs>	 10Operations, 10Labs, 10Labs-Infrastructure, 10labs-sprint-117, 10labs-sprint-118: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#3317065 (10chasemp) This task was to make a plan for user mgmt access to bare metal as a service @dzahn to help clarify, which we have...
[00:43:38] <wikibugs>	 (03CR) 10Kaldari: [C: 031] "Looks good to me. Let's schedule for later this week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007) (owner: 10MaxSem)
[01:36:24] <wikibugs>	 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3317112 (10tstarling)
[01:39:39] <wikibugs>	 10Operations, 10Labs, 10Labs-Infrastructure, 10labs-sprint-117, 10labs-sprint-118: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#3317114 (10Dzahn) Got it, thank you both. Yep!
[02:21:34] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 32s)
[02:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:38] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jun  6 02:27:37 UTC 2017 (duration 6m 3s)
[02:27:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:24:16] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10I18n: wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there - https://phabricator.wikimedia.org/T166782#3317206 (10whym)
[03:38:36] <icinga-wm>	 PROBLEM - Disk space on ms-be1016 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error
[03:39:06] <icinga-wm>	 PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 62220 MB (12% inode=99%)
[03:41:36] <icinga-wm>	 PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sdh1]
[03:54:06] <icinga-wm>	 RECOVERY - Disk space on elastic1019 is OK: DISK OK
[04:01:36] <icinga-wm>	 RECOVERY - Disk space on ms-be1016 is OK: DISK OK
[04:08:36] <icinga-wm>	 RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[05:08:36] <icinga-wm>	 PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old.
[05:41:36] <icinga-wm>	 PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
[05:45:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317260 (10Marostegui) This went back to faulty again: ``` BatteryType: BBU Battery State: Unknown   Battery backup charge time : 0 hours ```  Raid went back to WriteThrough: ``` Default Cache...
[05:51:36] <icinga-wm>	 RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[05:54:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317261 (10Marostegui) And it is back: ``` 05:51 < icinga-wm> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy     Default Cache Policy: WriteBack, Read...
[05:56:11] <marostegui>	 !log Deploy alter table s3 on db1075 (eqiad master) - T166278
[05:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:21] <stashbot>	 T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278
[06:13:00] <wikibugs>	 (03PS8) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850)
[06:33:44] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Add comment about db1089 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357341 (https://phabricator.wikimedia.org/T166935)
[06:36:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:37:15] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add comment about db1089 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357341 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui)
[06:38:32] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Add comment about db1089 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357341 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui)
[06:38:41] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Add comment about db1089 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357341 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui)
[06:40:06] <wikibugs>	 10Operations, 10ops-codfw: mw2221 stuck after reboot - https://phabricator.wikimedia.org/T165734#3317286 (10MoritzMuehlenhoff) 05Open>03Resolved Closing, that host is running and repooled for a while now.
[06:40:08] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comments about current status of db1089 - T166935 (duration: 00m 39s)
[06:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:18] <stashbot>	 T166935: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935
[06:49:27] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on elastic2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:00:11] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317309 (10elukey) >>! In T166141#3315357, @jcrespo wrote: > Not really, we have almost decided the goals for Q1, and they are all quite urgent and for hardware that ha...
[07:10:48] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317327 (10Marostegui) >>! In T166141#3317309, @elukey wrote: >>>! In T166141#3315357, @jcrespo wrote: >> Not really, we have almost decided the goals for Q1, and they...
[07:11:36] <icinga-wm>	 PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
[07:12:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317328 (10Marostegui) And again: ``` ˜/icinga-wm 9:11> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough ```
[07:27:49] <hashar>	 moritzm: Guten Tag.  Looks like hhvm-dbg   wmf4+exp1  is not available on apt.wikimedia.org :D
[07:28:09] <hashar>	 that breaks puppet / apt-get install on the beta cluster instances which have  hhvm-dbg installed: 
[07:28:10] <hashar>	  hhvm-dbg : Depends: hhvm (= 3.18.2+dfsg-1+wmf4) but 3.18.2+dfsg-1+wmf4+exp1 is installed
[07:29:46] <icinga-wm>	 PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:29:47] <icinga-wm>	 PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:29:47] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:30:46] <icinga-wm>	 RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker
[07:30:46] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[07:30:46] <icinga-wm>	 RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient
[07:32:14] <moritzm>	 hashar: which host is that? wmf+exp1 is intentionally not on apt.wikimedia.org, it was an experimental build to investigate the behaviour of persistant connections as used by the job runners
[07:32:32] <hashar_>	 moritzm: deployment-jobrunner02.deployment-prep.eqiad.wmflabs 
[07:32:48] <moritzm>	 those tests are completed, so if it's still around, I'll simply downgrade to +wmf4
[07:33:00] <hashar_>	 ahh make sense. thanks
[07:35:39] <moritzm>	 fixed
[07:38:22] <elukey>	 hashar_: my fault! I was testing a new hhvm version for the connect timeouts!
[07:38:36] <hashar_>	 no worries :-}
[07:38:47] <hashar_>	 has the experience been any helpful ?
[07:39:56] <icinga-wm>	 PROBLEM - IPMI Temperature on labsdb1003 is CRITICAL: Sensor Type(s) Temperature Status: Critical [Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processo
[07:41:26] <icinga-wm>	 PROBLEM - puppet last run on elastic2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:43:23] <wikibugs>	 (03PS1) 10Hashar: beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344
[07:45:30] <wikibugs>	 (03PS2) 10Hashar: beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344
[07:46:46] <wikibugs>	 (03CR) 10Hashar: "puppet fails on deployment-prep instances since the hiera "role" hierarchy is not looked up. https://gerrit.wikimedia.org/r/#/c/357344/ sh" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto)
[07:49:04] <wikibugs>	 (03PS1) 10DCausse: Test that replica counts are within sane bounds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345
[07:49:26] <icinga-wm>	 RECOVERY - Check systemd state on elastic2014 is OK: OK - running: The system is fully operational
[07:50:31] <wikibugs>	 (03CR) 10DCausse: [C: 031] "lgtm," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar)
[07:53:00] <gehel>	 !log starting upgrade to elasticsearch 5.3.2 on cirrus eqiad cluster - T163708
[07:53:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:09] <stashbot>	 T163708: Upgrade the production search cluster to elastic 5.3.2 - https://phabricator.wikimedia.org/T163708
[07:55:05] <wikibugs>	 (03CR) 10Elukey: "Tested the script (with echo instead of restart of course) on rdb1001 and rdb1002, everything works as expected (only the latter prints re" [puppet] - 10https://gerrit.wikimedia.org/r/357193 (owner: 10Giuseppe Lavagetto)
[07:56:54] <elukey>	 hashar: re "has the exp been helpful" - not really, I didn't manage to fix the issue depicted in https://github.com/facebook/hhvm/issues/7854 but I have a better idea about what is happening. I feel that I am missing something trivial
[07:57:24] <wikibugs>	 (03Abandoned) 10DCausse: [wikitech] Increase weight on Tool and Nova Resource ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354474 (https://phabricator.wikimedia.org/T165725) (owner: 10DCausse)
[08:04:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Do not confine LLDP fact to physical/non-VMs [puppet] - 10https://gerrit.wikimedia.org/r/354108 (owner: 10Faidon Liambotis)
[08:04:09] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Do not confine LLDP fact to physical/non-VMs [puppet] - 10https://gerrit.wikimedia.org/r/354108 (owner: 10Faidon Liambotis)
[08:04:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Do not confine LLDP fact to physical/non-VMs [puppet] - 10https://gerrit.wikimedia.org/r/354108 (owner: 10Faidon Liambotis)
[08:07:14] <wikibugs>	 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091#3317363 (10Gehel)
[08:07:42] <wikibugs>	 (03PS2) 10Volans: Tox: find and check Python files without extension [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169)
[08:14:27] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317379 (10elukey) Sure I am concerned too, this is why I asked if it was possible to order the hardware as soon as possible to be ready to work on it by the end of Q1 :)
[08:14:51] <wikibugs>	 (03PS3) 10Elukey: Correct pageview_hourly loading scheme on pivot home [puppet] - 10https://gerrit.wikimedia.org/r/357315 (owner: 10Nuria)
[08:17:57] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[08:18:06] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[08:18:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0
[08:19:26] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on elastic2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:19:28] <wikibugs>	 (03CR) 10Elukey: [C: 032] Correct pageview_hourly loading scheme on pivot home [puppet] - 10https://gerrit.wikimedia.org/r/357315 (owner: 10Nuria)
[08:20:45] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: network: Add kubernetes pod/service IPs [puppet] - 10https://gerrit.wikimedia.org/r/341792
[08:21:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:22:12] <wikibugs>	 10Operations, 10Continuous-Integration-Config, 10Operations-Software-Development, 10Patch-For-Review: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169#2590514 (10Joe) >>! In T144169#2836235, @fgiunchedi wrote: > After some discussion in https://gerrit.wik...
[08:22:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] network: Add kubernetes pod/service IPs [puppet] - 10https://gerrit.wikimedia.org/r/341792 (owner: 10Alexandros Kosiaris)
[08:22:57] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[08:22:58] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: network: Add kubernetes pod/service IPs [puppet] - 10https://gerrit.wikimedia.org/r/341792
[08:23:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] network: Add kubernetes pod/service IPs [puppet] - 10https://gerrit.wikimedia.org/r/341792 (owner: 10Alexandros Kosiaris)
[08:24:35] <moritzm>	 gehel: /var/log/elasticsearch/production-search-codfw.log depleted the root partiton on elastic2014
[08:25:03] <gehel>	 moritzm: yep, I'm on it with dcausse. We are trying to understand what went wrong before truncating the logs...
[08:25:24] <moritzm>	 ok
[08:30:47] <wikibugs>	 (03PS3) 10Volans: Tox: find and check Python files without extension [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169)
[08:33:12] <wikibugs>	 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091#3317408 (10dcausse) The doc `1615532` is in the general index at [[https://ja.wikipedia.org/wiki/%E3%83%8E%E3%83%BC%E3%83%88:%E9%80%9F%E6%B0%B4%E5%A4%AA%E9...
[08:34:26] <icinga-wm>	 RECOVERY - Check systemd state on elastic2014 is OK: OK - running: The system is fully operational
[08:37:27] <wikibugs>	 (03PS31) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815)
[08:39:22] <gehel>	 !log raise log level to WARN for TransportShardBulkAction on elasticsearch cirrus - T167091
[08:39:26] <icinga-wm>	 RECOVERY - puppet last run on elastic2014 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[08:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:30] <stashbot>	 T167091: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091
[08:39:43] <wikibugs>	 (03PS32) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815)
[08:43:03] <jynus>	 !log stopping db2035 and preparing for reimage
[08:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:23] <wikibugs>	 (03CR) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff)
[08:48:53] <wikibugs>	 (03CR) 10Elukey: "Added another round of refactoring to eliminate the old zookeeper_cluster_name global variable from the profile zookeeper server." [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey)
[08:50:49] <wikibugs>	 (03PS4) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136)
[08:51:44] <wikibugs>	 (03PS5) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136)
[08:54:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I would rename the file to a profile, as it's more of a profile (that could be included in all labstore roles, if there is more than one)." [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff)
[08:54:59] <dcausse>	 !log restarting elastic2014 to reclaim free space on deleted log file
[08:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:56] <icinga-wm>	 RECOVERY - Disk space on elastic2014 is OK: DISK OK
[09:00:05] <jouncebot>	 akosiaris and hashar: Respected human, time to deploy Jobrunner service to scap3 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T0900). Please do the needful.
[09:00:21] <akosiaris>	 hashar: ok, I am starting the dance :-)
[09:00:30] <hashar>	 sure
[09:00:38] <hashar>	 and as far as I can tell, scap only supports a single service :/
[09:02:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317454 (10Marostegui) db1075 the master is done - the whole shard is completed.
[09:02:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317455 (10Marostegui) ^ Wrong ticket
[09:02:34] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3317456 (10mmodell)
[09:03:14] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3215445 (10mmodell) @robh: Thanks, this is on my radar. The current plan is to switch production to phab2001.codfw temporarily, then switch back from there to phab1...
[09:03:36] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani)
[09:04:17] <akosiaris>	 !log disable puppet on all jobrunners
[09:04:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:34] <akosiaris>	 actually let's do this correctly
[09:04:39] <akosiaris>	 !log disable puppet on all jobrunners T129148
[09:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:47] <stashbot>	 T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148
[09:05:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani)
[09:05:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani)
[09:08:00] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3317477 (10mmodell)
[09:08:17] <akosiaris>	 running puppet on tin T129148
[09:08:23] <akosiaris>	 !log running puppet on tin T129148
[09:08:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:52] <akosiaris>	 !log git pull and scap deploy --init for jobrunner T129148
[09:11:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:59] <stashbot>	 T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148
[09:12:58] <akosiaris>	 !log running puppet on mw1161 T129148
[09:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[09:15:00] <hashar>	 on tin files in /srv/deployment/jobrunner/jobrunner  are still owned by trebuchet user :/
[09:15:03] <volans>	 XioNoX: ^^^
[09:15:26] <akosiaris>	 hashar: yeah I think I 'll move the repo and have puppet recreate it 
[09:15:42] <akosiaris>	 instead of messing with the ownerships manually
[09:15:48] <hashar>	 ;-D
[09:15:52] <XioNoX>	 volans: yeah, that's the zayo circuit we're getting emails about, zayo is working on it, fiber cut
[09:16:12] <volans>	 yeah I was wondering why it alarmed again
[09:16:13] <akosiaris>	 on mw1161 at least it's owned by mwdeploy 
[09:16:19] <volans>	 or is recovering? :D
[09:17:07] <volans>	 if the message changes it actually alarms again ;)
[09:17:35] <akosiaris>	 does it ? I don't remember so 
[09:17:43] <jynus>	 it is not true
[09:17:51] <jynus>	 slave lag changes every time
[09:18:02] <jynus>	 and it only alarms once
[09:18:13] <jynus>	 you may be confused with passive alerts
[09:18:16] <akosiaris>	 maybe we have multiple notifications for that ?
[09:18:18] <volans>	  I might remember wrong then... mumble mumble
[09:18:32] <volans>	 right jynus only for passive, my bad
[09:18:32] <jynus>	 or special cases like CRIT -> WARN -> CRIT
[09:18:39] <akosiaris>	 anyway back to my migration, will look into it later if you guys haven't figured it out by then
[09:19:17] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on elastic2014 is OK: OK: synced at Tue 2017-06-06 09:19:15 UTC.
[09:19:42] <akosiaris>	 !log running puppet again on tin, after moving /serv/deployment/jobrunner/jobrunner T129148
[09:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:51] <stashbot>	 T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148
[09:19:54] <hashar>	 :-}
[09:20:35] <hashar>	 still owned by trebuchet do
[09:21:26] <akosiaris>	 hashar: actually, it looks like this is correct
[09:21:35] <akosiaris>	 all repos are owned by trebuchet
[09:21:38] <hashar>	 so I made you move the repo for nothing?
[09:21:54] <akosiaris>	 no worries, it probably fixed some permissions anyway
[09:22:18] <akosiaris>	 things like adding +s on g
[09:22:43] <akosiaris>	 or something anyway. I am not gonna cargo cult on this one
[09:22:49] <hashar>	 sure
[09:23:28] <hashar>	 so I guess lets try on canaries hosts?  They are mw1299.eqiad.wmnet mw2247.codfw.wmnet
[09:23:41] <akosiaris>	 no, not yet
[09:23:50] <akosiaris>	 I first have to run puppet over there
[09:24:04] <akosiaris>	 but it's good to run across all of them judging from mw1161
[09:25:33] <akosiaris>	 !log moving around jobrunner/jobrunner was probably not required T129148
[09:25:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:42] <stashbot>	 T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148
[09:25:43] <akosiaris>	 !log running puppet on videoscalers T129148
[09:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:00] <akosiaris>	 !log running puppet on jobrunners T129148
[09:29:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:19] <akosiaris>	 hashar: ok, I am running the first scap deploy -v
[09:31:24] <hashar>	 \O/
[09:31:30] <logmsgbot>	 !log akosiaris@tin Started deploy [jobrunner/jobrunner@161c84c]: (no justification provided)
[09:31:37] <icinga-wm>	 RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[09:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:48] <akosiaris>	 canary deploy successful. Continue? [y]es/[n]o/[c]ontinue all groups: 
[09:31:50] <akosiaris>	 :-)
[09:32:22] <hashar>	 so on mw1299 the jobrunner service got restarted 
[09:32:32] <hashar>	 and jobchron is left behind.  Will have to figure out a solution for that later on
[09:32:48] <logmsgbot>	 !log akosiaris@tin Finished deploy [jobrunner/jobrunner@161c84c]: (no justification provided) (duration: 01m 17s)
[09:32:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:55] <akosiaris>	 done
[09:33:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation (034 comments) [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto)
[09:33:34] <hashar>	 akosiaris: on the whole fleet ?
[09:33:45] <akosiaris>	 !log restart jobchron service across jobrunners T129148
[09:33:49] <akosiaris>	 hashar: yup
[09:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:55] <stashbot>	 T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148
[09:34:00] <akosiaris>	 hashar: 8 groups in total
[09:34:14] <akosiaris>	 no, 9 if we count the canary group
[09:34:25] <akosiaris>	 looks fine across all of them
[09:34:40] <hashar>	 apparently yes
[09:35:08] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082
[09:35:15] <akosiaris>	 !log restart jobchron service across videoscalers T129148
[09:35:17] <icinga-wm>	 PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:45] <_joe_>	 akosiaris: videoscalers are trustys btw
[09:35:52] <akosiaris>	 _joe_: yes I know
[09:36:33] <akosiaris>	 so... here's an interesting twist
[09:36:54] <akosiaris>	 jobrunners in codfw should not be running either jobchron.service or jobrunner.service
[09:37:43] <hashar>	 the jobrunner service on mw2153  exited 143  apparently
[09:37:56] <_joe_>	 akosiaris: yes
[09:38:12] <_joe_>	 akosiaris: both need to be stopped.
[09:38:17] <icinga-wm>	 PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:38:36] <akosiaris>	 yeah doing so now
[09:39:17] <icinga-wm>	 PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:41:08] <hashar>	 we still want to deploy the code on both eqiad and codfw hosts dont we ?
[09:41:12] <akosiaris>	 yes
[09:41:28] <hashar>	 should the services be masked on codfw host so?
[09:41:39] <akosiaris>	 !log stop jobchron/jobrunner processes across jobrunner and videoscalers in codfw
[09:41:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:17] <icinga-wm>	 RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational
[09:43:17] <icinga-wm>	 RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational
[09:43:17] <icinga-wm>	 RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational
[09:43:26] <moritzm>	 !log installing perl security updates
[09:43:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:37] <wikibugs>	 (03PS1) 1020after4: Use maniphest.edit in phab_epipe.py [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043)
[09:43:54] <akosiaris>	 hashar: so the services are stopped and disabled 
[09:43:58] <icinga-wm>	 PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata]
[09:43:58] <akosiaris>	 but not masked ...
[09:44:07] <akosiaris>	 as it they can be start on a whim
[09:44:24] <hashar>	 and disabled state would prevent scap from restarting, wouldn't it?
[09:44:32] <akosiaris>	 no
[09:44:57] <akosiaris>	 disabled/enabled is orthogonal to start/stop/allowed_to_start/allowed_to_stop
[09:45:09] <akosiaris>	 it's only related to what happens after a boot
[09:45:22] <akosiaris>	 enabled => will be started on boot, disabled => nope
[09:45:22] <wikibugs>	 (03CR) 1020after4: "This is not particularly urgent." [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043) (owner: 1020after4)
[09:47:11] <akosiaris>	 hashar: and mask would not work on the videoscalers
[09:47:20] <akosiaris>	 they are trusty.. no systemd there, hence no mask
[09:48:11] <akosiaris>	 hashar: turns out we have at least 2 actionables
[09:48:21] <akosiaris>	 1) scap should allow restarting multiple services
[09:48:35] <akosiaris>	 2) figure out how to not restart jobrunner/jobchron in the non-active DC
[09:49:14] <akosiaris>	 2) should probably not be done in scap but rather in puppet 
[09:49:55] <akosiaris>	 _joe_: perhaps we should not be shipping jobchron and jobrunner systemd/upstart units on the non-active DC ?
[09:50:11] <hashar>	 :(
[09:50:54] <hashar>	 for multiple dc,  I wonder how parsoid solves that
[09:51:11] <akosiaris>	 what do you mean ?
[09:51:19] <akosiaris>	 parsoid runs on both DCs
[09:51:28] <akosiaris>	 it has no state to manage
[09:51:38] <hashar>	 ahh
[09:54:50] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-Site-requests, 10media-storage, 10Patch-For-Review: Server side upload for Yann - https://phabricator.wikimedia.org/T166806#3317566 (10fgiunchedi) Thanks everyone for your help in debugging this, @Yann did uploading other files with e.g. v2c worked eventually? I see @...
[09:55:34] <hashar>	 akosiaris: I am filling a task for "scap should allow restarting multiple services"
[09:55:41] <akosiaris>	 hashar: ok
[09:55:45] <akosiaris>	 thanks!
[09:58:56] <akosiaris>	 hashar: ok I think we are done, aside from the actionables above. I am moving on to a different task, lemme know if you need something
[10:03:27] <hashar>	 akosiaris: want me to fill the 2) one about not restarting in non-active DC ?
[10:09:44] <akosiaris>	 hashar: yeah sure. thanks!
[10:09:49] <akosiaris>	 appreciated :-)
[10:10:02] <akosiaris>	 I am actually looking into the service provider
[10:10:16] <akosiaris>	 looks like "mask" is supported in some versions 
[10:10:48] <akosiaris>	 but not 3.18 that we got :-(
[10:10:53] <akosiaris>	 3.8*
[10:12:04] <hashar>	 akosiaris: https://phabricator.wikimedia.org/T167104 and you are on cc
[10:12:05] <wikibugs>	 10Operations, 10Deployment-Systems, 10MediaWiki-JobRunner, 10scap2, and 2 others: figure out how to not restart jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3317757 (10hashar)
[10:12:14] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[10:12:27] <akosiaris>	 https://github.com/puppetlabs/puppet/commit/1e2a71604e184477f94d516d86366adf1fef2452
[10:12:33] <akosiaris>	 on 4.2.0 and later
[10:12:35] <hashar>	 and we have videoscalers still on trusty
[10:12:38] <akosiaris>	 yes
[10:12:48] <akosiaris>	 so we need a different solution than mask
[10:13:06] <akosiaris>	 that works across distros and supports puppet 3.8
[10:14:07] <hashar>	 and some hosts are still on puppet 3.7 ?
[10:14:26] <paravoid>	 you're forgetting the actionable of upgrading those to trusty :)
[10:14:36] <paravoid>	 moritzm: what's the status of that? I think you were working on it last
[10:16:27] <wikibugs>	 10Operations, 10Deployment-Systems, 10MediaWiki-JobRunner, 10scap2, and 2 others: figure out how to not restart jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3317805 (10akosiaris) I 've had a quick look into the `mask` feature of systemd. That should allow us to mark a...
[10:16:37] <paravoid>	 hashar: no puppet 3.7 hosts
[10:17:03] <akosiaris>	 paravoid: that's irrelevant. we can't really use nicely the mask feature anyway. See ^
[10:17:13] <akosiaris>	 we are btw kind of abusing it in maps IIRC
[10:17:29] <paravoid>	 yeah well, we should be doing that anyway
[10:17:32] <akosiaris>	 we allow the dev to mask the services in order to avoid having them restarted by puppet
[10:17:48] <akosiaris>	 but that will have to change as it seems ;)
[10:17:57] <moritzm>	 it's blocked by the HHVM 3.18 memory corruption exposed by luasandbox, HHVM developers acknowledged that earlier the day, but not patch available yet
[10:18:15] <paravoid>	 ah
[10:18:38] <paravoid>	 we can't do it with hhvm 3.12?
[10:20:11] <wikibugs>	 10Operations, 10DBA: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#1761444 (10Marostegui) I have installed `wmf-mariadb101_10.1.23-1_amd64.deb` on a fresh stretch to play around with it - will get back to you if I see issues!
[10:20:50] <moritzm>	 I'd rather not, 3.12 is EOLed and has a couple of bugs fixed in 3.18, so better start with the current version from the start
[10:21:14] <wikibugs>	 10Operations, 10Continuous-Integration-Config, 10Operations-Software-Development, 10Patch-For-Review: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169#3317841 (10fgiunchedi) >>! In T144169#3317402, @Joe wrote: >> Re: naming, I think an obvious convention...
[10:22:15] <moritzm>	 !log installing NSS security updates
[10:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:06] <paravoid>	 but aren't they orthogonal transitions by nature?
[10:28:34] <icinga-wm>	 PROBLEM - OTRS SMTP on mendelevium is CRITICAL: connect to address 10.64.32.174 and port 25: Connection refused
[10:28:43] <wikibugs>	 (03PS1) 10Ema: check_ipmi_temp: turn off sel checking [puppet] - 10https://gerrit.wikimedia.org/r/357361 (https://phabricator.wikimedia.org/T125205)
[10:28:44] <icinga-wm>	 PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:29:08] <moritzm>	 somewhat, but the 3.12-3.18 move doesn't add particular risk either
[10:30:04] <jouncebot>	 Amir1: Respected human, time to deploy Deploy new wb_terms configs to testwikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1030). Please do the needful.
[10:30:28] <godog>	 akosiaris hashar can you let me know when your migration is finished? I have a scap update lined up but didn't want to do it at the same time
[10:31:12] <Amir1>	 I can put pause on my deployment  
[10:31:33] <godog>	 Amir1: no worries I can wait, not urgent
[10:31:36] <godog>	 ping me when done tho
[10:32:10] <Amir1>	 kk
[10:32:17] <akosiaris>	 godog: we are done
[10:33:08] <Amir1>	 okay, I start the deployment 
[10:33:56] <wikibugs>	 (03CR) 10Marostegui: [C: 031] check_ipmi_temp: turn off sel checking [puppet] - 10https://gerrit.wikimedia.org/r/357361 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema)
[10:34:11] <wikibugs>	 (03PS1) 10Hashar: jobrunner: add exit codes to services units [puppet] - 10https://gerrit.wikimedia.org/r/357362
[10:34:33] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] check_ipmi_temp: turn off sel checking [puppet] - 10https://gerrit.wikimedia.org/r/357361 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema)
[10:35:25] <wikibugs>	 (03CR) 10Hashar: "The alternative is to clean up redisJobChronService / redisJobRunnerService and have them exit(0) when SIGHUP/SIGKILL/SIGTERM are caught." [puppet] - 10https://gerrit.wikimedia.org/r/357362 (owner: 10Hashar)
[10:35:34] <icinga-wm>	 RECOVERY - OTRS SMTP on mendelevium is OK: SMTP OK - 0.150 sec. response time
[10:35:44] <icinga-wm>	 RECOVERY - Check systemd state on mendelevium is OK: OK - running: The system is fully operational
[10:36:45] <Amir1>	 marostegui: Around? https://phabricator.wikimedia.org/T165246 says it's resolved but in labs, the column is not there.
[10:36:53] <marostegui>	 Amir1: checking
[10:37:28] <wikibugs>	 (03PS9) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850)
[10:38:53] <marostegui>	 I can see it there on 1001 and 1003
[10:40:35] <wikibugs>	 10Operations, 10DBA: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3317984 (10jcrespo) I have to package 10.1.24 and fix some things- coming soon.
[10:42:43] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Update Templates for 5.0.20 OTRS version [software/otrs] - 10https://gerrit.wikimedia.org/r/357363
[10:43:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update Templates for 5.0.20 OTRS version [software/otrs] - 10https://gerrit.wikimedia.org/r/357363 (owner: 10Alexandros Kosiaris)
[10:48:43] <wikibugs>	 (03CR) 10Ladsgroup: [C: 032] Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup)
[10:54:14] <icinga-wm>	 RECOVERY - IPMI Temperature on aqs1004 is OK: Sensor Type(s) Temperature Status: OK
[10:54:14] <icinga-wm>	 RECOVERY - IPMI Temperature on ms-be2028 is OK: Sensor Type(s) Temperature Status: OK
[10:54:46] <wikibugs>	 (03PS2) 10Ladsgroup: Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197)
[10:56:56] <wikibugs>	 (03Draft2) 10Alexandros Kosiaris: Edit Project Config [software/servermon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/357351
[10:57:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Edit Project Config [software/servermon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/357351 (owner: 10Alexandros Kosiaris)
[10:57:18] <wikibugs>	 (03CR) 10Ladsgroup: [C: 032] Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup)
[10:57:54] <icinga-wm>	 RECOVERY - IPMI Temperature on labsdb1003 is OK: Sensor Type(s) Temperature Status: OK
[10:58:44] <icinga-wm>	 RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old.
[10:59:56] <wikibugs>	 (03Merged) 10jenkins-bot: Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup)
[11:02:23] <logmsgbot>	 !log ladsgroup@tin Synchronized wmf-config/Wikibase-production.php: Enabling writing in full entity id in testwikidatawiki (T165197) (duration: 00m 39s)
[11:02:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:32] <stashbot>	 T165197: Change configuration of Wikidata to write term_full_entity_id - https://phabricator.wikimedia.org/T165197
[11:02:44] <wikibugs>	 (03CR) 10jenkins-bot: Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup)
[11:08:34] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[11:09:34] <icinga-wm>	 RECOVERY - IPMI Temperature on labsdb1011 is OK: Sensor Type(s) Temperature Status: OK
[11:10:24] <icinga-wm>	 RECOVERY - IPMI Temperature on db2049 is OK: Sensor Type(s) Temperature Status: OK
[11:11:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::mediawiki::scaler: use more sensible intervals for checks [puppet] - 10https://gerrit.wikimedia.org/r/357366
[11:11:42] <_joe_>	 volans: ^^ care to give a input?
[11:12:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: servermon: Deploy with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/357367 (https://phabricator.wikimedia.org/T129152)
[11:13:48] <volans>	 _joe_: sure
[11:14:16] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3318112 (10Marostegui) >>! In T166853#3311393, @jcrespo wrote: > This one is also showing the following alarm-  >  >  > ``` > Sensor Type(s) Temperature Status: Critical [Power Unit 2 18-VR P2 = Critical, Po...
[11:15:08] <wikibugs>	 10Operations, 10Ops-Access-Requests: Request access to analytics-privatedata-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T167116#3318114 (10GoranSMilovanovic)
[11:15:23] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/357366 (owner: 10Giuseppe Lavagetto)
[11:15:57] <wikibugs>	 (03PS1) 10Ladsgroup: Whitelist term_full_entity_id in wb_terms table [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114)
[11:16:24] <icinga-wm>	 RECOVERY - IPMI Temperature on wtp1010 is OK: Sensor Type(s) Temperature Status: OK
[11:16:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Whitelist term_full_entity_id in wb_terms table [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup)
[11:16:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::scaler: use more sensible intervals for checks [puppet] - 10https://gerrit.wikimedia.org/r/357366 (owner: 10Giuseppe Lavagetto)
[11:16:58] <moritzm>	 !log uploaded ferm 2.3.2+wmf1 to apt.wikimedia.org/stretch-wikimedia (T166653)
[11:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:09] <stashbot>	 T166653: ferm broken in stretch - https://phabricator.wikimedia.org/T166653
[11:19:04] <icinga-wm>	 PROBLEM - salt-minion processes on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:04] <icinga-wm>	 PROBLEM - dhclient process on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:14] <icinga-wm>	 PROBLEM - DPKG on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:14] <icinga-wm>	 PROBLEM - Check systemd state on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:24] <icinga-wm>	 PROBLEM - Disk space on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:26] <Amir1>	 godog: I'm done 
[11:19:34] <icinga-wm>	 PROBLEM - configured eth on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:44] <icinga-wm>	 PROBLEM - puppet last run on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:04] <godog>	 Amir1: thanks for the heads up!
[11:20:32] <Amir1>	 Sorry it took so long, testing it was difficult 
[11:21:39] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup)
[11:21:54] <icinga-wm>	 RECOVERY - salt-minion processes on d-i-test is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[11:21:54] <icinga-wm>	 RECOVERY - dhclient process on d-i-test is OK: PROCS OK: 0 processes with command name dhclient
[11:22:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Whitelist term_full_entity_id in wb_terms table [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup)
[11:22:04] <icinga-wm>	 RECOVERY - DPKG on d-i-test is OK: All packages OK
[11:22:04] <icinga-wm>	 RECOVERY - Check systemd state on d-i-test is OK: OK - running: The system is fully operational
[11:22:15] <icinga-wm>	 RECOVERY - Disk space on d-i-test is OK: DISK OK
[11:22:20] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: servermon: Deploy with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/357367 (https://phabricator.wikimedia.org/T129152)
[11:22:24] <icinga-wm>	 RECOVERY - configured eth on d-i-test is OK: OK - interfaces up
[11:22:34] <icinga-wm>	 RECOVERY - puppet last run on d-i-test is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:22:41] <akosiaris>	 d-i-test in puppet ?
[11:22:47] <wikibugs>	 (03CR) 10Ladsgroup: "The error doesn't look related" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup)
[11:23:07] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-2] "Well, that init script is pretty terrible. We need to code review this and "it was part of a package before" isn't a terribly good argumen" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn)
[11:25:54] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: servermon: Deploy with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/357367 (https://phabricator.wikimedia.org/T129152)
[11:26:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/6671/netmon1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/357367 (https://phabricator.wikimedia.org/T129152) (owner: 10Alexandros Kosiaris)
[11:26:34] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[11:28:06] <logmsgbot>	 !log akosiaris@tin Started deploy [servermon/servermon@4a2288f]: (no justification provided)
[11:28:10] <logmsgbot>	 !log akosiaris@tin Finished deploy [servermon/servermon@4a2288f]: (no justification provided) (duration: 00m 04s)
[11:28:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:34] <icinga-wm>	 RECOVERY - IPMI Temperature on cp1049 is OK: Sensor Type(s) Temperature Status: OK
[11:29:58] <wikibugs>	 10Operations, 10Deployment-Systems, 10Monitoring, 10scap2, and 2 others: Deploy servermon with scap3 - https://phabricator.wikimedia.org/T129152#3318171 (10akosiaris) 05Open>03Resolved a:03akosiaris Migration completed. Servermon is now deployed using scap3. Resolving.
[11:34:45] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on ms-be2001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167118
[11:34:48] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3318179 (10ops-monitoring-bot)
[11:36:16] <wikibugs>	 (03PS2) 10Hashar: Test that replica counts are within sane bounds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345 (owner: 10DCausse)
[11:36:44] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[11:39:30] <wikibugs>	 (03CR) 10Hashar: [C: 032] "Cherry picked on tip of master :-}  Thanks for the cleanup." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345 (owner: 10DCausse)
[11:40:33] <wikibugs>	 (03Merged) 10jenkins-bot: Test that replica counts are within sane bounds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345 (owner: 10DCausse)
[11:40:42] <wikibugs>	 (03CR) 10jenkins-bot: Test that replica counts are within sane bounds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345 (owner: 10DCausse)
[11:46:26] <wikibugs>	 10Operations, 10Monitoring, 10Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3318241 (10ema)
[11:46:29] <wikibugs>	 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3318229 (10ema)
[11:46:44] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[11:46:44] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0
[11:47:25] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082
[11:47:57] <wikibugs>	 (03PS1) 10Gehel: elasticsearch - raise logging of TransportShardBulkAction to WARN [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091)
[11:48:12] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3318249 (10Volans)
[11:48:18] <_joe_>	 ema: ^^, but I still want to add a reactor.stop() or something to the deferreds treating ipvs
[11:48:42] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3318179 (10Volans)
[11:48:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto)
[11:50:42] <ema>	 _joe_: protocol pony :)
[11:50:58] <_joe_>	 ema: :P
[11:51:06] <_joe_>	 ema: uhm I did break some tests
[12:02:03] <wikibugs>	 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3318296 (10jcrespo)
[12:08:15] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2039774
[12:08:25] <gehel>	 !log kill stuck osm replication on maps1001
[12:08:25] <wikibugs>	 10Operations, 10Upstream: ferm broken in stretch - https://phabricator.wikimedia.org/T166653#3318320 (10MoritzMuehlenhoff) I've uploaded ferm 2.3-2+wmf1 to stretch-wikimedia which unbreaks ferm by waiting on nss-lookup.target.  This makes ferm start 1-1.5 seconds later than the default stretch unit using netwo...
[12:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:40] <gehel>	 tmux
[12:08:51] <gehel>	 oops, wrong windows...
[12:10:55] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Your change remove Java 7 from the Jessie slaves. However we still have Maven jobs using Java 7:" [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox)
[12:16:03] <bblack>	 !log cp1049 - restaret varnish backend for mailbox lag
[12:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:15] <wikibugs>	 (03CR) 10Mforns: "This patch is to be abandoned at some point right?" [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey)
[12:17:34] <wikibugs>	 (03CR) 10Hashar: [C: 031] "This change is still cherry picked on the CI puppet master. That is to unbreak Puppet on the permanent instances that still have HHVM." [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox)
[12:18:15] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0
[12:18:26] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3318345 (10Marostegui) 05Open>03Resolved Going to close this for now as we had no more crashes lately.
[12:24:55] <wikibugs>	 (03PS7) 10Paladox: contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611)
[12:26:00] <wikibugs>	 (03Abandoned) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey)
[12:26:07] <wikibugs>	 (03CR) 10Paladox: "Thanks, done. @Dzahn we have to install java 7 and java 8 on Jessie so have to do the if checks like this. Some of the android tests were " [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox)
[12:28:41] <wikibugs>	 (03PS1) 10Elukey: Disable role::analytics_cluster::refinery::job::guard [puppet] - 10https://gerrit.wikimedia.org/r/357372 (https://phabricator.wikimedia.org/T166937)
[12:29:12] <elukey>	 paravoid: --^
[12:29:41] <wikibugs>	 10Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3318369 (10MoritzMuehlenhoff) >>! In T158583#3310361, @faidon wrote: >>  >> I think a mere component/component-staging mapping would be better; it provides more consistency and would also allow generic handlin...
[12:30:34] <wikibugs>	 (03CR) 10Elukey: [C: 032] Disable role::analytics_cluster::refinery::job::guard [puppet] - 10https://gerrit.wikimedia.org/r/357372 (https://phabricator.wikimedia.org/T166937) (owner: 10Elukey)
[12:33:24] <wikibugs>	 (03CR) 10DCausse: elasticsearch - raise logging of TransportShardBulkAction to WARN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) (owner: 10Gehel)
[12:35:00] <wikibugs>	 (03CR) 10Mforns: [C: 031] "LGTM!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey)
[12:35:56] <wikibugs>	 (03PS1) 10Hashar: contint: remove HHVM from Trusty permanent instances [puppet] - 10https://gerrit.wikimedia.org/r/357373
[12:37:20] <wikibugs>	 (03PS2) 10Gehel: elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091)
[12:38:32] <wikibugs>	 (03CR) 10Hashar: [C: 031] "Cherry picked on CI puppet master.  I have manually purged HHVM." [puppet] - 10https://gerrit.wikimedia.org/r/357373 (owner: 10Hashar)
[12:39:05] <paravoid>	 elukey: <3
[12:39:13] <wikibugs>	 10Operations, 10Traffic, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3318372 (10Gehel) I can't seem to reproduce the problem from my browser. Looking at the [[ https://grafana.wikimedia.org/dashboard/db/maps-...
[12:40:05] <logmsgbot>	 !log mobrovac@tin Started deploy [changeprop/deploy@e92dd66]: Bump src to bc8abf3
[12:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:50] <logmsgbot>	 !log mobrovac@tin Finished deploy [changeprop/deploy@e92dd66]: Bump src to bc8abf3 (duration: 01m 45s)
[12:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:16] <wikibugs>	 10Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3318374 (10faidon) >>! In T158583#3318369, @MoritzMuehlenhoff wrote: >>>! In T158583#3310361, @faidon wrote: >>>  >>> I think a mere component/component-staging mapping would be better; it provides more consis...
[12:44:00] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Scap: Bump version to 3.5.8-1 [puppet] - 10https://gerrit.wikimedia.org/r/357239 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani)
[12:44:27] <wikibugs>	 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3318379 (10Cmjohnson) @fguinchedi he batteries for ms-be1020 and 1019 are on-site...please let me know when you want to swap them
[12:45:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3318380 (10Cmjohnson) @Marostegui The battery is here...let me know when you want to replace
[12:45:25] <icinga-wm>	 PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198
[12:47:15] <icinga-wm>	 RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2
[12:47:21] <godog>	 cmjohnson1: awesome, today in 20" works for you? re: hp battery
[12:47:39] <cmjohnson1>	 godog..sure
[12:48:25] <godog>	 cmjohnson1: nice, I'll ping you in 20" !
[12:48:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Scap: Bump version to 3.5.8-1 [puppet] - 10https://gerrit.wikimedia.org/r/357239 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani)
[12:48:55] <icinga-wm>	 PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198
[12:49:45] <icinga-wm>	 RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[12:49:59] <wikibugs>	 10Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3318384 (10MoritzMuehlenhoff) Ok, got it. I think there are valid use cases for both, for a temporary migration (e.g. towards a new HHVM LTS) it seems more useful to use -staging, while for more generational c...
[12:50:05] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 39 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[12:51:25] <godog>	 !log upgrade scap to 3.5.8 - T127762
[12:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:33] <stashbot>	 T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762
[12:52:19] <zeljkof>	 hashar: nothing for eu swat so far
[12:53:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3318391 (10Marostegui) @Cmjohnson I will depool the server now and ping you once it is down.
[12:54:09] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357375 (https://phabricator.wikimedia.org/T166518)
[12:54:15] <hashar>	 jouncebot: refresh
[12:54:17] <jouncebot>	 I refreshed my knowledge about deployments.
[12:54:21] <hashar>	 jouncebot: next
[12:54:21] <jouncebot>	 In 0 hour(s) and 5 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1300)
[12:54:30] <hashar>	 zeljkof: nice :-}
[12:55:46] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357375 (https://phabricator.wikimedia.org/T166518) (owner: 10Marostegui)
[12:56:51] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357375 (https://phabricator.wikimedia.org/T166518) (owner: 10Marostegui)
[12:57:00] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357375 (https://phabricator.wikimedia.org/T166518) (owner: 10Marostegui)
[12:58:05] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 for maintenance - T166518 (duration: 00m 39s)
[12:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:13] <stashbot>	 T166518: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518
[12:58:23] <marostegui>	 !log Shutdown db1094 for maintenance - T166518
[12:58:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:05] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1300). Please do the needful.
[13:01:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Refactor facts exporting to better cleanup facts (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356814 (owner: 10Alexandros Kosiaris)
[13:03:18] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Refactor facts exporting to better cleanup facts [puppet] - 10https://gerrit.wikimedia.org/r/356814
[13:05:17] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) (owner: 10Gehel)
[13:07:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, for now it would work as-is. Though since runtime can be significant I think an improvement would be to accept an optional list of f" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans)
[13:08:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: ms-be2013 / 16 / 17 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357377 (https://phabricator.wikimedia.org/T162609)
[13:08:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: fix missing yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/357378
[13:11:18] <godog>	 cmjohnson1: ok to start from ms-be1019 ? I'll power down
[13:13:53] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3318430 (10fgiunchedi) @Papaul this host is scheduled for decom and has otherwise no production data, don't bother replacing the disk
[13:14:59] <cmjohnson1>	 godog hold on...I only received 1 bbu for you not the 2
[13:15:17] <cmjohnson1>	 let me check to make sure we're replacing the correct server in case HP needs a log
[13:15:30] <godog>	 cmjohnson1: ok
[13:16:35] <icinga-wm>	 RECOVERY - HP RAID on db1094 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK
[13:16:41] <cmjohnson1>	 godog: let's do ms-be1020 please
[13:17:03] <godog>	 cmjohnson1: sure, I'll downtime and power off
[13:19:12] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/357378 (owner: 10Filippo Giunchedi)
[13:21:22] <wikibugs>	 (03PS3) 10Gehel: elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091)
[13:21:25] <godog>	 cmjohnson1: should be off now / shortly
[13:21:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3318469 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good now - thanks Chris! ```    Cache Backup Power Source: Batteries    Battery/Capacitor...
[13:24:30] <cmjohnson1>	 godog: powering on
[13:26:44] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357379
[13:26:52] <wikibugs>	 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3318478 (10ema)
[13:28:27] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: ms-be2013 / 16 / 17 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357377 (https://phabricator.wikimedia.org/T162609)
[13:29:49] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357379 (owner: 10Marostegui)
[13:29:58] <godog>	 cmjohnson1: yep thanks it 1020 is back, I'll update the task
[13:30:15] <wikibugs>	 10Operations, 10Deployment-Systems, 10MediaWiki-JobRunner, 10Release-Engineering-Team (Kanban), 10Scap (Scap3-Adoption-Phase1): figure out how to not restart jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3318487 (10thcipriani)
[13:31:25] <wikibugs>	 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3318492 (10fgiunchedi) ms-be1020 had its bbu swapped, error cleared:  ``` # /usr/local/lib/nagios/plugins/check_hpssacli  OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I...
[13:32:16] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357379 (owner: 10Marostegui)
[13:32:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] install_server: ms-be2013 / 16 / 17 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357377 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi)
[13:32:28] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357379 (owner: 10Marostegui)
[13:33:16] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 with low weight (duration: 00m 40s)
[13:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:47] <wikibugs>	 10Operations, 10Monitoring, 10Services (next), 10User-Joe, 10User-mobrovac: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3318499 (10mobrovac)
[13:34:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318500 (10Cmjohnson) @elukey new raid controllers for an1033 and 1039 are on-site. please let me know when you want to swap them out
[13:34:43] <elukey>	 cmjohnson1: whenever you want!
[13:34:52] <elukey>	 I'd just need half an hour to drain the hosts
[13:34:56] <cmjohnson1>	 let's do this now so I can get back to the new servers
[13:35:04] <cmjohnson1>	 okay...ping me once they're ready
[13:35:32] <elukey>	 cmjohnson1: sure, draining them now
[13:36:12] <wikibugs>	 10Operations, 10ops-eqiad, 10User-Joe: Decom mw1170-mw1179, and replace them with new systems. - https://phabricator.wikimedia.org/T167130#3318501 (10Cmjohnson)
[13:37:09] <wikibugs>	 10Operations, 10ops-eqiad, 10User-Joe: Decom mw1170-mw1179, and replace them with new systems. - https://phabricator.wikimedia.org/T167130#3318513 (10Cmjohnson)
[13:37:11] <wikibugs>	 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3318514 (10Cmjohnson)
[13:37:15] <icinga-wm>	 RECOVERY - HP RAID on ms-be1020 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK
[13:38:23] <wikibugs>	 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301040 (10MoritzMuehlenhoff) I've added Keith to pwstore and he confirmed that it's working fine.
[13:38:35] <wikibugs>	 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3318523 (10MoritzMuehlenhoff)
[13:38:56] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase db1094 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357381
[13:39:32] <elukey>	 !log shutdown analytics1033 and analytics1039 to replace their BBU - T166140
[13:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:43] <stashbot>	 T166140: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140
[13:42:31] <wikibugs>	 10Operations, 10OTRS: Upgrade OTRS to 5.0.20 - https://phabricator.wikimedia.org/T167131#3318529 (10akosiaris)
[13:42:41] <wikibugs>	 10Operations, 10OTRS: Upgrade OTRS to 5.0.20 - https://phabricator.wikimedia.org/T167131#3318544 (10akosiaris) 05Open>03Resolved
[13:43:43] <wikibugs>	 10Operations, 10OTRS, 10Patch-For-Review: Upgrade OTRS to 5.0.19 - https://phabricator.wikimedia.org/T165284#3261913 (10akosiaris) Almost a week has passed, I 'll resolve this one. Feel free to reopen. Note that per T167131 we have already upgraded to 5.0.20
[13:43:50] <wikibugs>	 10Operations, 10OTRS, 10Patch-For-Review: Upgrade OTRS to 5.0.19 - https://phabricator.wikimedia.org/T165284#3318550 (10akosiaris) 05Open>03Resolved
[13:44:06] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1094 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357381 (owner: 10Marostegui)
[13:45:00] <wikibugs>	 10Operations, 10OTRS, 10Upstream: Investigate OTRS 5.0.6 memory leak - https://phabricator.wikimedia.org/T126448#3318552 (10akosiaris) 05Open>03declined I am gonna resolve this as Declined. Upstream did not verify this bug's existence and we have mitigations in place anyway.
[13:45:23] <wikibugs>	 (03PS4) 10Tjones: Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829)
[13:45:26] <elukey>	 cmjohnson1: the hosts should begin to shutdown in a minute
[13:45:33] <elukey>	 (analytics1033 and 1039)
[13:45:48] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1094 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357381 (owner: 10Marostegui)
[13:45:57] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase db1094 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357381 (owner: 10Marostegui)
[13:46:03] <wikibugs>	 10Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3318555 (10jcrespo) 05Open>03stalled
[13:46:55] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1094 weight (duration: 00m 40s)
[13:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:26] <wikibugs>	 (03PS2) 10Andrew Bogott: novastats:  Add 'diskspace.py' script [puppet] - 10https://gerrit.wikimedia.org/r/357014 (https://phabricator.wikimedia.org/T163796)
[13:50:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] novastats:  Add 'diskspace.py' script [puppet] - 10https://gerrit.wikimedia.org/r/357014 (https://phabricator.wikimedia.org/T163796) (owner: 10Andrew Bogott)
[13:51:25] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: fix missing yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/357378
[13:52:53] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Restore db1094 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357383
[13:52:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] hieradata: fix missing yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/357378 (owner: 10Filippo Giunchedi)
[13:53:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] hieradata: fix missing yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/357378 (owner: 10Filippo Giunchedi)
[13:54:49] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3318627 (10faidon) >>! In T166888#3316057, @greg wrote: > Looking at the data we have it seems that the tests themselves take about [[ https://integration.wikimedia.org...
[13:57:10] <godog>	 ugh looks like eqiad / smokeping can't talk at all to cr1-eqdfw ? https://smokeping.wikimedia.org/smokeping.cgi?target=codfw.Core.cr1-eqdfw 
[13:57:30] <godog>	 XioNoX ^
[13:59:30] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1094 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357383 (owner: 10Marostegui)
[14:00:33] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1094 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357383 (owner: 10Marostegui)
[14:00:42] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Restore db1094 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357383 (owner: 10Marostegui)
[14:01:30] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1094 original weight (duration: 00m 40s)
[14:01:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:10] <XioNoX>	 godog, I can reach that router, investigating, is that causing any outage? 
[14:02:39] <cmjohnson1>	 elukey: 1033 is powering up
[14:03:03] <godog>	 XioNoX: no impact afaict no, I was surprised though that smokeping in eqiad stopped being able to talk to it
[14:03:32] <wikibugs>	 (03PS2) 10Mobrovac: Set the User-Agent header field when doing requests; v0.1.2 [software/service-checker] - 10https://gerrit.wikimedia.org/r/356870
[14:05:12] <wikibugs>	 10Operations, 10Monitoring, 10Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3318684 (10ema)
[14:05:15] <wikibugs>	 10Operations, 10Monitoring: labvirt1008/labsdb1001: FreeIPMI returned an empty header map - https://phabricator.wikimedia.org/T167138#3318672 (10ema)
[14:07:07] <elukey>	 cmjohnson1: ack
[14:15:30] <elukey>	 an1033 looks good
[14:16:53] <wikibugs>	 (03PS1) 10Andrew Bogott: diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388
[14:19:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388 (owner: 10Andrew Bogott)
[14:20:12] <wikibugs>	 (03PS2) 10Andrew Bogott: diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388
[14:20:32] <cmjohnson1>	 elukey: 1039 is powering on
[14:21:18] <XioNoX>	 godog: yeah, it's weird, mtr works, but not pings
[14:23:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Set the User-Agent header field when doing requests; v0.1.2 [software/service-checker] - 10https://gerrit.wikimedia.org/r/356870 (owner: 10Mobrovac)
[14:24:04] <volans>	 XioNoX: I'm seeing at least a couple of Icinga alarms flapping regarding 208.80.153.198
[14:25:39] <elukey>	 cmjohnson1: thanks!
[14:25:40] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2046271
[14:26:10] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[14:26:14] <wikibugs>	 10Operations, 10Monitoring, 10Services (next), 10User-Joe, 10User-mobrovac: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3315673 (10Joe) I would start monitoring restbase on text-lb and maps on text-upload.
[14:27:00] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0
[14:27:19] <paravoid>	 XioNoX: are we losing eqdfw?
[14:27:55] <XioNoX>	 paravoid: some conenctivity issues from at least eqiad, still investigating
[14:28:00] <icinga-wm>	 PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198
[14:28:05] <paravoid>	 there is no eqiad->eqdfw
[14:28:07] <paravoid>	 just codfw->eqdfw
[14:28:26] <volans>	 XioNoX: the active icinga is tegmen now and it is in codfw
[14:29:44] <wikibugs>	 (03PS10) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850)
[14:29:49] <wikibugs>	 (03PS7) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850)
[14:29:50] <icinga-wm>	 RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[14:32:44] <wikibugs>	 (03Abandoned) 10Hashar: contint: ElasticSearch role for build logs [puppet] - 10https://gerrit.wikimedia.org/r/322488 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar)
[14:32:54] <XioNoX>	 paravoid: like icmp can't go through but mtr works fine
[14:33:14] <XioNoX>	 except directly from outside
[14:33:56] <wikibugs>	 (03PS3) 10Andrew Bogott: diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388
[14:36:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388 (owner: 10Andrew Bogott)
[14:42:48] <elukey>	 cmjohnson1: 1039 is good, thanks a lot!
[14:42:58] <cmjohnson1>	 great...the other okay?
[14:43:04] <elukey>	 yep yep all good
[14:43:10] <elukey>	 the BBU shows up as optimal now
[14:44:00] <wikibugs>	 (03Abandoned) 10Hashar: contint: update unattended-upgrade setting [puppet] - 10https://gerrit.wikimedia.org/r/315079 (owner: 10Hashar)
[14:44:02] <wikibugs>	 (03Abandoned) 10Hashar: contint: unattended upgrade from distro [puppet] - 10https://gerrit.wikimedia.org/r/315084 (owner: 10Hashar)
[14:45:24] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318884 (10Cmjohnson) 05Open>03Resolved Replaced both bbu's   Return shipping info Fedex 9612018 6911799 02034386 96112018 6911799 02034379
[14:45:47] <wikibugs>	 (03PS3) 10Hashar: zuul: rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/299151
[14:47:49] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318897 (10elukey) 05Resolved>03Open
[14:47:57] <wikibugs>	 (03CR) 10Hashar: [V: 031 C: 031] "I have used that patch when refactoring the zuul class to use hiera.  I typically use this to assert the zuul manifests somehow compile." [puppet] - 10https://gerrit.wikimedia.org/r/299151 (owner: 10Hashar)
[14:49:00] <icinga-wm>	 PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198
[14:49:12] <wikibugs>	 (03PS3) 10Hashar: nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490
[14:49:50] <icinga-wm>	 RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[14:49:55] <wikibugs>	 (03CR) 10Hashar: [V: 031 C: 031] "That covers an issue we had earlier when generating the icinga contacts (see fix https://gerrit.wikimedia.org/r/#/c/331459/ )" [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar)
[14:50:13] <wikibugs>	 (03CR) 10Hashar: [V: 031 C: 031] nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar)
[14:50:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: create swift user home [puppet] - 10https://gerrit.wikimedia.org/r/357396 (https://phabricator.wikimedia.org/T162609)
[14:53:00] <icinga-wm>	 PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198
[14:53:50] <icinga-wm>	 RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[14:56:09] <wikibugs>	 (03PS2) 10Filippo Giunchedi: swift: create swift user home [puppet] - 10https://gerrit.wikimedia.org/r/357396 (https://phabricator.wikimedia.org/T162609)
[14:56:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318969 (10elukey)
[14:57:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Refactor facts exporting to better cleanup facts [puppet] - 10https://gerrit.wikimedia.org/r/356814 (owner: 10Alexandros Kosiaris)
[14:58:25] <moritzm>	 !log installing libsndfile security updates on trusty
[14:58:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:36] <wikibugs>	 (03PS8) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850)
[15:00:05] <wikibugs>	 (03PS3) 10Filippo Giunchedi: swift: create swift user home [puppet] - 10https://gerrit.wikimedia.org/r/357396 (https://phabricator.wikimedia.org/T162609)
[15:02:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] swift: create swift user home [puppet] - 10https://gerrit.wikimedia.org/r/357396 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi)
[15:02:50] <icinga-wm>	 PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198
[15:03:40] <icinga-wm>	 RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2
[15:06:10] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[15:07:00] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0
[15:07:37] <wikibugs>	 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3319033 (10Papaul) p:05Triage>03Normal
[15:07:50] <icinga-wm>	 PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198
[15:09:40] <icinga-wm>	 RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2
[15:12:00] <icinga-wm>	 PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198
[15:13:50] <icinga-wm>	 RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[15:15:44] <wikibugs>	 (03CR) 10Paladox: [C: 031] "Looks correct in how it should be translated to json. Though untested." [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043) (owner: 1020after4)
[15:16:21] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 (owner: 10Faidon Liambotis)
[15:16:56] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 (owner: 10Faidon Liambotis)
[15:17:24] <paravoid>	 madhuvishy: I'd do https://gerrit.wikimedia.org/r/#/c/356107/ first, then PCC the rest
[15:17:40] <paravoid>	 but ymmv :)
[15:18:15] <madhuvishy>	 paravoid: ah yes that's why it wouldn't let me rebase, yup okay
[15:18:21] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] "Find a way to test this?" [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043) (owner: 1020after4)
[15:21:51] <logmsgbot>	 !log otto@tin Started deploy [eventlogging/analytics@37233cd]: (no justification provided)
[15:21:56] <logmsgbot>	 !log otto@tin Finished deploy [eventlogging/analytics@37233cd]: (no justification provided) (duration: 00m 04s)
[15:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:10] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:10] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:10] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:11] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:11] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:11] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:11] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:12] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:12] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:14] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:23:14] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad PMCID) timed out before a response was received
[15:24:01] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[15:24:03] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[15:24:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[15:24:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[15:24:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[15:24:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy
[15:24:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[15:24:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[15:24:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[15:24:10] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[15:24:10] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[15:26:50] <icinga-wm>	 PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198
[15:27:59] <wikibugs>	 (03CR) 10Madhuvishy: "I get the reasoning for `which tc`, but I think we should err on the side of fully qualifying the path with TC=/sbin/tc. This would be san" [puppet] - 10https://gerrit.wikimedia.org/r/356107 (owner: 10Faidon Liambotis)
[15:28:40] <wikibugs>	 (03PS4) 10Ottomata: Add exception for events tagged as coming from MW [puppet] - 10https://gerrit.wikimedia.org/r/356626 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans)
[15:28:50] <icinga-wm>	 RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2
[15:29:10] <paravoid>	 madhuvishy: it will be just noise
[15:29:44] <paravoid>	 all of these fully-qualified paths are just misconceptions and people carrying over old unix practices to modern systems
[15:29:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add exception for events tagged as coming from MW [puppet] - 10https://gerrit.wikimedia.org/r/356626 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans)
[15:30:02] <wikibugs>	 (03PS1) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140)
[15:30:18] <wikibugs>	 (03PS4) 10Gehel: elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091)
[15:32:01] <madhuvishy>	 paravoid: sorry, in team meeting, will respond in a bit
[15:32:08] <paravoid>	 k, sorry
[15:32:19] <wikibugs>	 (03PS5) 10Ottomata: Add exception for events tagged as coming from MW [puppet] - 10https://gerrit.wikimedia.org/r/356626 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans)
[15:32:41] <paravoid>	 I really don't care that much though :)
[15:33:55] <wikibugs>	 (03PS2) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140)
[15:34:18] <wikibugs>	 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3319132 (10Papaul) a:05Papaul>03jcrespo firmware upgrade complete
[15:34:20] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:35:26] <wikibugs>	 (03PS3) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140)
[15:38:29] <paravoid>	 I hate how this is a hiera variable :(
[15:39:29] <paravoid>	 XioNoX: what's the status of eqdfw then?
[15:40:51] <XioNoX>	 paravoid: still investigating, IPv6 goes through fine, but v4 doesn't in some cases
[15:42:05] <Zppix>	 Ipv6 is a b!**
[15:42:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[15:42:19] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add exception for events tagged as coming from MW [puppet] - 10https://gerrit.wikimedia.org/r/356626 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans)
[15:43:01] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0
[15:43:20] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational
[15:44:00] <icinga-wm>	 PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198
[15:44:50] <icinga-wm>	 RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2
[15:45:40] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 2010
[15:51:44] <wikibugs>	 (03PS5) 10Gehel: elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091)
[15:53:51] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) (owner: 10Gehel)
[15:58:34] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::graphite::alerts: add transformNull to some alerts [puppet] - 10https://gerrit.wikimedia.org/r/357409
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1600).
[16:00:52] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: role::graphite::alerts: add transformNull to some alerts [puppet] - 10https://gerrit.wikimedia.org/r/357409
[16:02:00] <icinga-wm>	 PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198
[16:02:04] <_joe_>	 oh jenkins you'll get me old
[16:02:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::graphite::alerts: add transformNull to some alerts [puppet] - 10https://gerrit.wikimedia.org/r/357409 (owner: 10Giuseppe Lavagetto)
[16:02:32] <Zppix>	 _joe_: got to love ci
[16:02:53] <icinga-wm>	 RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[16:06:13] <wikibugs>	 (03PS4) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140)
[16:06:50] <icinga-wm>	 PROBLEM - puppet last run on graphite1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:07:55] <wikibugs>	 (03PS5) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140)
[16:11:00] <icinga-wm>	 PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198
[16:11:50] <icinga-wm>	 RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2
[16:13:24] <elukey>	 mmmm noop for pcc, weird
[16:15:16] <elukey>	 argh I got the wrong role 
[16:15:52] <wikibugs>	 (03CR) 10Hashar: [C: 031] contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox)
[16:16:20] <elukey>	 ah nice we have the same info duplicated
[16:17:08] <elukey>	 ottomata: do we need hieradata/role/common/analytics/hadoop/worker.yaml since we have hieradata/role/common/analytics_cluster/hadoop/worker.yaml ??
[16:18:09] <madhuvishy>	 paravoid: following up, this script is run both by puppet, but also intended to be run by users operationally - in that case, user PATH could affect the script running 
[16:18:50] <paravoid>	 I'm in meeting now :)
[16:18:59] <paravoid>	 but why would users run this operationally?
[16:18:59] <madhuvishy>	 mostly just seems cleaner to me, and in coherence with every other script to fully qualify paths. if we have a standard for this sorta thing, i'm happy to follow it :)
[16:19:00] <madhuvishy>	 okay :)
[16:20:24] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3319296 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete
[16:20:32] <wikibugs>	 (03PS6) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140)
[16:22:16] <wikibugs>	 (03PS7) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140)
[16:22:44] <icinga-wm>	 RECOVERY - puppet last run on graphite1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[16:24:24] <elukey>	 ok this should work
[16:24:34] <elukey>	 need to remove the stale hieradata
[16:24:51] <wikibugs>	 (03PS1) 10DCausse: [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276)
[16:26:55] <wikibugs>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/6676/" [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140) (owner: 10Elukey)
[16:27:48] <wikibugs>	 (03CR) 10Chad: "Yeah, it's a pretty terrible script. It comes directly from upstream, we've never actually done any changes to it ourselves." [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn)
[16:28:35] <wikibugs>	 (03CR) 10EBernhardson: [C: 031] [cirrus] Enable crossproject search on all wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse)
[16:29:10] <wikibugs>	 (03PS2) 10DCausse: [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276)
[16:31:21] <wikibugs>	 (03PS3) 10DCausse: [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276)
[16:35:44] <bblack>	 !log rebooted lvs1007 (kernel update)
[16:35:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:44] <icinga-wm>	 PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100%
[16:39:06] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3319367 (10greg) Look, we get it, CI is slower than people would like. When we proposed the nodepool backend we were optimizing for clean environment and maintainabilit...
[16:39:44] <icinga-wm>	 RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[16:39:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3319369 (10elukey) Current status:  ``` elukey@neodymium:~$ sudo cumin 'R:class = role::analytics_cluster::hadoop::worker' 'megacl...
[16:40:10] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3319372 (10greg) We're still open to helping get ops/puppet in a better place than it is now with small wins until we can migrate to the new docker based system, if you...
[16:41:24] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3319375 (10jcrespo) `Rebuilding`, will resolve once it is done.
[16:41:28] <bblack>	 !log rebooted lvs1007 (kernel update)
[16:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:59] <wikibugs>	 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3319376 (10jcrespo) Papaul, you are the best!
[16:43:04] <icinga-wm>	 PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100%
[16:43:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend account expiry date for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/357417
[16:45:14] <icinga-wm>	 RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms
[16:46:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Extend account expiry date for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/357417 (owner: 10Muehlenhoff)
[16:46:58] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3319383 (10Papaul) a:05Papaul>03Gehel @ Gehel The new SSD is in place
[16:49:23] <wikibugs>	 (03PS1) 10Elukey: Delete unused role/common/analytics/hadoop configs [puppet] - 10https://gerrit.wikimedia.org/r/357418
[16:49:44] <icinga-wm>	 RECOVERY - HP RAID on elastic2020 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2 - Controller: OK - Battery/Capacitor: OK
[16:50:47] <volans>	 gehel: is that true? ^^^^ I cannot believe it :-P
[16:51:31] <gehel>	 volans: beleive it or not, but elastic2020 might be back in the cluster tomorrow! (emphasis on *might*)
[16:51:40] <volans>	 lol :D
[16:53:17] <wikibugs>	 (03CR) 10Tjones: [C: 031] [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse)
[16:53:44] <moritzm>	 !log installing wireshark security updates on trusty (jessie already fixed)
[16:53:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:40] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-Site-requests, 10media-storage, 10Patch-For-Review: Server side upload for Yann - https://phabricator.wikimedia.org/T166806#3319419 (10Yann) Other small files uploaded OK. Thanks to @Dereckson for processing this.
[16:58:27] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3319424 (10RobH)
[16:58:57] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3219.30 Read Requests/Sec=2155.30 Write Requests/Sec=18.10 KBytes Read/Sec=35650.80 KBytes_Written/Sec=92.00
[17:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1700).
[17:00:16] <halfak>	 Nothing for ORES today
[17:02:08] <wikibugs>	 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3319450 (10Papaul) @jcrespo  thanks
[17:04:42] <madhuvishy>	 paravoid: mostly for testing, when a rule is changed etc - they are just one offs - i'm not married to the idea of using fully qualified paths, but made sense to be consistent. 
[17:05:37] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:37] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:37] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:37] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:37] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:47] <icinga-wm>	 PROBLEM - MD RAID on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:47] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:47] <icinga-wm>	 PROBLEM - Disk space on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:47] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:48] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:48] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:57] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:58] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=20.10 Read Requests/Sec=7.50 Write Requests/Sec=11.80 KBytes Read/Sec=30.40 KBytes_Written/Sec=77.20
[17:05:58] <icinga-wm>	 PROBLEM - configured eth on ms-be2016 is CRITICAL: Return code of 255 is out of bounds
[17:05:59] <godog>	 that's me, downtime expired
[17:06:07] <godog>	 fixed
[17:06:57] <icinga-wm>	 RECOVERY - configured eth on ms-be2016 is OK: OK - interfaces up
[17:07:37] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[17:07:37] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[17:07:37] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 31.04, 29.33, 19.21
[17:07:37] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[17:07:37] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[17:07:47] <icinga-wm>	 RECOVERY - MD RAID on ms-be2016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[17:07:47] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[17:07:47] <icinga-wm>	 RECOVERY - Disk space on ms-be2016 is OK: DISK OK
[17:07:48] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[17:07:48] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[17:07:57] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[17:09:10] <wikibugs>	 (03CR) 10Elukey: "10 NO-OPs: https://puppet-compiler.wmflabs.org/6678/" [puppet] - 10https://gerrit.wikimedia.org/r/357418 (owner: 10Elukey)
[17:11:47] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational
[17:15:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: aptrepo: add hp-mcp-stretch [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609)
[17:16:43] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3319490 (10RobH)
[17:16:49] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2001 - https://phabricator.wikimedia.org/T167160#3319508 (10RobH)
[17:19:30] <thcipriani>	 mobrovac: _joe_ jrbranaa blubber workboard created, moved 2 tasks that looked related, mess up as you all see fit: https://phabricator.wikimedia.org/project/view/2812/ https://phabricator.wikimedia.org/source/blubber/
[17:25:07] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 14 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[17:25:59] <wikibugs>	 (03PS1) 10Chad: Group0 to wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357424
[17:26:11] <wikibugs>	 (03CR) 10Chad: [C: 04-2] "For later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357424 (owner: 10Chad)
[17:29:01] <wikibugs>	 (03PS2) 10Filippo Giunchedi: aptrepo: add hp-mcp-stretch and thirdparty/hwraid [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609)
[17:29:03] <wikibugs>	 (03PS1) 10Jdlrobson: Disable page previews on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894)
[17:30:18] <jdlrobson>	 tgr: HEY.. i cant login to wikitech anymore - phone broken and no access to google authenticator
[17:30:28] <jdlrobson>	 can you reset me again?
[17:31:06] <logmsgbot>	 !log demon@tin Started scap: testwiki to wmf.3, prepping l10n cache
[17:31:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:17] <icinga-wm>	 RECOVERY - configured eth on labtestvirt2003 is OK: OK - interfaces up
[17:32:40] <chasemp>	 ^ papaul result of your fix thank you
[17:33:36] <papaul>	 chasemp: no problem
[17:33:51] <wikibugs>	 (03PS1) 10Jdlrobson: Update ContentNamespaces for Commons Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077)
[17:48:41] <icinga-wm>	 PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:48:46] <wikibugs>	 10Operations, 10Ops-Access-Requests: Request access to analytics-privatedata-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T167116#3318114 (10RobH) I can confirm that @GoranSMilovanovic has signed an NDA with WMF Legal (I checked against the 2016/17 NDA housekeeping: Volunteer accounts with S...
[17:53:53] <gehel>	 ^^ elastic2020 downtime seems to have expired, I'm adding some downtime, waiting for the reimage...
[17:55:05] <wikibugs>	 (03PS1) 10Andrew Bogott: diskspace.py:  Add one more special-case flavor size. [puppet] - 10https://gerrit.wikimedia.org/r/357431
[17:56:05] <wikibugs>	 (03PS2) 10Andrew Bogott: diskspace.py:  Add one more special-case flavor size. [puppet] - 10https://gerrit.wikimedia.org/r/357431
[18:00:04] <jouncebot>	 MaxSem and Niharika: Dear anthropoid, the time has come. Please deploy Deploy LoginNotify (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1800).
[18:01:15] <RainbowSprinkles>	 MaxSem: Just about done with my thing
[18:03:04] <logmsgbot>	 !log demon@tin Finished scap: testwiki to wmf.3, prepping l10n cache (duration: 31m 58s)
[18:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:49] <RainbowSprinkles>	 MaxSem: All yours
[18:05:00] <MaxSem>	 danke, RainbowSprinkles 
[18:05:33] <wikibugs>	 (03PS2) 10MaxSem: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007)
[18:05:45] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007) (owner: 10MaxSem)
[18:06:36] <MaxSem>	 RainbowSprinkles, should I revert the livehack (testwiki to wmf.3)?
[18:06:37] <Niharika>	 I'm around to test, MaxSem. 
[18:06:51] <wikibugs>	 (03Merged) 10jenkins-bot: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007) (owner: 10MaxSem)
[18:07:05] <RainbowSprinkles>	 MaxSem: Rather you not. Feel free to put a local commit there for it
[18:08:11] <MaxSem>	 it doesn't mind if I pull, so just leaving it there
[18:10:45] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/357317/2 (duration: 00m 44s)
[18:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:31] <MaxSem>	 Niharika, pulled on mwdebug1002
[18:11:40] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0
[18:11:51] <Niharika>	 Checking.
[18:12:46] <Niharika>	 MaxSem: Looks good to me.
[18:13:47] <wikibugs>	 (03CR) 10jenkins-bot: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007) (owner: 10MaxSem)
[18:15:32] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/357317/2 (duration: 00m 44s)
[18:15:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:20] <jynus>	 I think downtimes got lost again
[18:16:51] <logmsgbot>	 !log maxsem@tin Started scap: LoginNotify to testwiki - rebuild messages
[18:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:54] <wikibugs>	 10Operations, 10Icinga, 10Monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3319932 (10jcrespo) I think this happened again. It didn't page because now I disable alerts every time I reimage a host, but page spam will c...
[18:20:30] <wikibugs>	 (03Abandoned) 10Chad: Group0 to wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357424 (owner: 10Chad)
[18:28:47] <RainbowSprinkles>	 MaxSem: Done soon?
[18:29:08] <MaxSem>	 18:24:11 Updating LocalisationCache for 1.30.0-wmf.3 using 10 thread(s)
[18:29:15] <RainbowSprinkles>	 Gdi.
[18:29:26] <RainbowSprinkles>	 I fucked up. Should've done wmf.4 anyway instead of wmf.3
[18:29:38] <MaxSem>	 :O
[18:29:44] <RainbowSprinkles>	 So now you're rebuilding a useless l10n cache
[18:30:14] <MaxSem>	 it's probably nearly done
[18:34:34] <Zppix>	 what wmf are we on are we back on schedule or we still behind?
[18:36:13] <greg-g>	 wmf.4 will go out this week
[18:36:21] <greg-g>	 explained here: https://phabricator.wikimedia.org/T165957#3309601
[18:36:25] <RainbowSprinkles>	 Also: https://tools.wmflabs.org/versions/
[18:36:35] <RainbowSprinkles>	 (will always tell you what version is deployed where)
[18:38:26] <Zppix>	 RainbowSprinkles:  that wont load properly for me  today... its my end i know  so ya
[18:40:34] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3320042 (10jcrespo) 05Open>03Resolved
[18:50:27] * RainbowSprinkles twiddles thumbs
[18:55:11] <logmsgbot>	 !log maxsem@tin Finished scap: LoginNotify to testwiki - rebuild messages (duration: 38m 19s)
[18:55:19] <MaxSem>	 Niharika, ^
[18:55:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:24] <RainbowSprinkles>	 38m is way too slow
[18:55:31] <RainbowSprinkles>	 Curious since I had just scapped prior
[18:55:43] <Niharika>	 Nice! Thanks MaxSem!
[18:56:00] <MaxSem>	 Niharika, works ok?
[18:56:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] diskspace.py:  Add one more special-case flavor size. [puppet] - 10https://gerrit.wikimedia.org/r/357431 (owner: 10Andrew Bogott)
[18:59:32] <Niharika>	 MaxSem: Seems so. 
[18:59:38] <MaxSem>	 woot
[19:00:00] <MaxSem>	 lunch
[19:00:04] <jouncebot>	 RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1900).
[19:01:54] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: testwiki back to wmf.2
[19:02:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:01] <wikibugs>	 (03PS1) 10Chad: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357442
[19:08:53] <logmsgbot>	 !log demon@tin Synchronized README: No-op, just forcing co-master sync (duration: 01m 27s)
[19:09:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:15] <logmsgbot>	 !log demon@tin Started scap: testwiki to wmf.4 + prepping l10n. again
[19:10:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:49] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#2511148 (10jcrespo) ``` $ cumin 'db20[33-70].*' 'hpssacli controller slot=0 show | grep -i firmware' 38 hosts will be targeted: db[2033-2070].codfw.wmnet Confirm to conti...
[19:13:55] <wikibugs>	 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3318229 (10jcrespo) It could be related to T141756#3320207
[19:17:40] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Delete unused role/common/analytics/hadoop configs [puppet] - 10https://gerrit.wikimedia.org/r/357418 (owner: 10Elukey)
[19:20:40] <icinga-wm>	 RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational
[19:23:47] <logmsgbot>	 !log demon@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details)
[19:23:47] <logmsgbot>	 !log demon@tin scap failed: RuntimeError scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) (duration: 13m 32s)
[19:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:07] <RainbowSprinkles>	 Crap.
[19:24:48] <logmsgbot>	 !log demon@tin Started scap: testwiki to wmf.4 + prepping l10n. again (x2)
[19:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:11] <wikibugs>	 (03PS3) 10Smalyshev: Enable archive indexing on delete for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357236 (https://phabricator.wikimedia.org/T162302)
[19:34:50] <icinga-wm>	 RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational
[19:35:00] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:36:15] <paladox>	 RainbowSprinkles ^^ mutante fixed it :)
[19:36:15] <mutante>	 !log cobalt - removed systemd unit file (that has issues with ulimit and isn't used yet) - ran "systemctl reset-failed" which cleared the "systemctl status" which made the Icinga check recover
[19:36:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:49] <mutante>	 RainbowSprinkles: yep, "systemctl reset-failed" is a thing, #systemd told me
[19:36:50] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.791 second response time
[19:37:19] <mutante>	 so i removed the unit file and did that, and cleared Icinga without any direct action on gerrit
[19:37:25] * RainbowSprinkles sighs
[19:38:13] <paladox>	 The systemd file will be readded when we re do the upgrade (but with the fix :)) so we will be able to try starting gerrit with systemctl again.
[19:42:48] <wikibugs>	 10Operations, 10Traffic, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3320381 (10debt) p:05Triage>03Normal
[19:44:18] <wikibugs>	 10Operations, 10Interactive-Sprint, 10Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#3320388 (10debt) Moving to prioritized as it's on our list of things that do need doing.
[19:45:07] <wikibugs>	 10Operations, 10Discovery, 10Interactive-Sprint, 10Maps (Maps-data): Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#3320393 (10debt) Moving to prioritized as it's on our list of things that do need doing.
[19:45:13] <logmsgbot>	 !log demon@tin Finished scap: testwiki to wmf.4 + prepping l10n. again (x2) (duration: 20m 25s)
[19:45:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:07] <wikibugs>	 (03CR) 10Chad: [C: 032] group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357442 (owner: 10Chad)
[19:54:48] <Zppix>	 jouncebot:  now
[19:54:48] <jouncebot>	 For the next 1 hour(s) and 5 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1900)
[19:55:44] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357442 (owner: 10Chad)
[19:57:58] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.4
[19:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:40] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3320503 (10Papaul)
[20:12:00] <wikibugs>	 (03CR) 10jenkins-bot: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357442 (owner: 10Chad)
[20:13:53] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2001 - https://phabricator.wikimedia.org/T167160#3320517 (10Papaul) @Robh @chasemp we have already a node with the name labtestneutron2001 in row B rack  B8 can we make this labtestneutron2002?
[20:16:04] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2001 - https://phabricator.wikimedia.org/T167160#3320521 (10chasemp) @papaul yes, thank you
[20:16:07] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2001 - https://phabricator.wikimedia.org/T167160#3320522 (10RobH) @Papaul: good catch!  Yes, lets just call this new host labtestneutron2002.
[20:16:16] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3320523 (10RobH)
[20:21:55] <RainbowSprinkles>	 !log gerrit: Down for just a moment, finally doing point release on cobalt
[20:22:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:06] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10Security-Team, 10Traffic, and 2 others: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3320680 (10Jdforrester-WMF) Mass-moving all items tagged for MediaWiki 1.30.0-wmf.3, as that was never released; ins...
[20:45:30] <wikibugs>	 (03PS1) 10Jdrewniak: Updating portals stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357490 (https://phabricator.wikimedia.org/T128546)
[20:45:32] <wikibugs>	 (03PS8) 10Dzahn: contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox)
[20:51:24] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I am surprised by how fast the processing is done on my machine. The additional run is barely noticeable on my machine :]" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans)
[20:53:06] <wikibugs>	 (03CR) 10Dzahn: "thanks for clarifying that it is indeed intended to install both versions at the same time" [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox)
[20:53:40] <wikibugs>	 (03CR) 10Dzahn: [C: 032] contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox)
[20:53:48] <paladox>	 thanks mutante ^^ :)
[20:53:59] <wikibugs>	 10Operations, 10Labs, 10cloud-services-team (Kanban): tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123#3320890 (10bd808)
[20:54:05] <wikibugs>	 10Operations, 10Labs, 10cloud-services-team (Kanban): Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#3320892 (10bd808)
[20:55:12] <hashar>	 mutante: paladox: danke/thanks!
[20:55:25] <paladox>	 :) your welcome
[20:56:01] <wikibugs>	 10Operations, 10Labs, 10cloud-services-team (Kanban): Investigate alternative RAID strategies for labstore1001/2 - https://phabricator.wikimedia.org/T162090#3320899 (10bd808)
[20:56:03] <wikibugs>	 10Operations, 10Labs, 10cloud-services-team (Kanban): Undo special tools-home and tools-project share definitions for NFS - https://phabricator.wikimedia.org/T161834#3320900 (10bd808)
[20:56:10] <wikibugs>	 10Operations, 10Labs, 10cloud-services-team (Kanban): labstore systemd state Icinga alarms - https://phabricator.wikimedia.org/T151322#3320902 (10bd808)
[20:57:11] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3320910 (10herron) a:03herron
[20:57:57] <mutante>	 de rien
[20:58:11] <mutante>	 submitted it now (the bot never says that part)
[20:59:56] <wikibugs>	 (03PS2) 10Dzahn: contint: remove HHVM from Trusty permanent instances [puppet] - 10https://gerrit.wikimedia.org/r/357373 (owner: 10Hashar)
[21:00:41] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "already cherry-picked" [puppet] - 10https://gerrit.wikimedia.org/r/357373 (owner: 10Hashar)
[21:01:11] <kaldari>	 RainbowSprinkles: Looks like the Prefs page on Test Wikipedia is messed up (maybe related to train deployment): https://test.wikipedia.org/wiki/Special:Preferences
[21:01:41] <RainbowSprinkles>	 I saw someone complaining about this last week but couldn't repro
[21:01:42] <RainbowSprinkles>	 Hmmm
[21:02:28] <kaldari>	 I think there's some JS not loading, but I don't see any JS errors in the console
[21:03:01] <kaldari>	 Should I file a bug?
[21:03:18] <hashar>	 mutante: ready to submit :)
[21:05:03] <RainbowSprinkles>	 kaldari: Yeah file a bug...
[21:05:18] <RainbowSprinkles>	 I'd say a whole lot of JS isn't loading
[21:05:32] <RainbowSprinkles>	 The logo changes too from the cool version
[21:05:40] <mutante>	 hashar: submitted :)
[21:05:46] <hashar>	 \O/
[21:06:33] <mutante>	 one more!
[21:06:35] <mutante>	 i see
[21:06:46] <wikibugs>	 (03PS14) 10Dzahn: contint: skip hhvm experimental pin on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox)
[21:07:30] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "per comments above, also already cherry-picked" [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox)
[21:07:43] <paladox>	 thanks :)
[21:09:50] <mutante>	 paladox: trying to remember about the "nocanon" fix
[21:10:12] <mutante>	 "https://wiki.jenkins-ci.org/display/JENKINS/Running+Jenkins+behind+Apache states we should use nocanon in Apache ProxyPass"  ah, yea
[21:10:39] <mutante>	 "Yep. We do that for gerrit too."  
[21:12:20] <wikibugs>	 (03PS7) 10Dzahn: Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 (owner: 10Paladox)
[21:12:40] <paladox>	 Yep
[21:13:11] <wikibugs>	 (03CR) 10Dzahn: "jenkins docs: "Both the nocanon option to ProxyPass, and AllowEncodedSlashes NoDecode, are required for certain Jenkins features to work."" [puppet] - 10https://gerrit.wikimedia.org/r/351391 (owner: 10Paladox)
[21:13:34] <TabbyCat>	 jynus: you there?
[21:13:49] <kaldari>	 RainbowSprinkles: Created bug: https://phabricator.wikimedia.org/T167216 No idea who to subscribe to it though.
[21:14:01] <mutante>	 paladox: also quote from Apache mod_proxy docs .. added
[21:14:23] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 (owner: 10Paladox)
[21:14:50] <RainbowSprinkles>	 kaldari: We could just start subscribing everyone at random until someone fixes it ;-)
[21:15:03] <kaldari>	 good idea!
[21:15:10] <TabbyCat>	 tgr: ping
[21:15:13] * RainbowSprinkles writes a greasemonkey script called "subscribe-all-the-people"
[21:15:26] <wikibugs>	 (03PS1) 10Framawiki: Lift IP throttle for Wikipedia Editathon (June 16th 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357510 (https://phabricator.wikimedia.org/T167201)
[21:15:43] <paladox>	 mutante thanks :)
[21:15:50] <RainbowSprinkles>	 kaldari: It's busted on mw.org too, I'm going to roll that back to wmf.2 for now
[21:15:55] <mutante>	 it needs a subscribe bot that finds the right people. like we have it for gerrit :)
[21:16:10] <kaldari>	 thanks
[21:16:30] <wikibugs>	 (03PS1) 10Chad: Moving mediawiki.org back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357511
[21:16:32] <mutante>	 that special wiki page would just be "keywords -> people" 
[21:16:46] <wikibugs>	 (03CR) 10Chad: [C: 032] Moving mediawiki.org back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357511 (owner: 10Chad)
[21:17:28] <RainbowSprinkles>	 mutante: Oh, I wasn't looking for the right people. I was just going to subscribe people at random until someone fixes it :p
[21:17:59] <wikibugs>	 (03Merged) 10jenkins-bot: Moving mediawiki.org back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357511 (owner: 10Chad)
[21:18:09] <wikibugs>	 (03CR) 10jenkins-bot: Moving mediawiki.org back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357511 (owner: 10Chad)
[21:18:29] <mutante>	 RainbowSprinkles: heheee, yea, just need to disable notifications for "unsubscribe" action 
[21:19:24] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: unbreak mw.org pref page
[21:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:38] <wikibugs>	 (03PS2) 10Dzahn: [Planet Wikimedia] Add blog.wikimedia.gr to Greek Planet [puppet] - 10https://gerrit.wikimedia.org/r/357254 (owner: 10Nemo bis)
[21:22:00] <wikibugs>	 (03PS1) 10Andrew Bogott: designate.conf:  Update the keystone_authtoken section [puppet] - 10https://gerrit.wikimedia.org/r/357512
[21:22:24] <RainbowSprinkles>	 kaldari: Definitely isolated to wmf.2 -> wmf.4 jump. Rolling mw.org back fixed it
[21:22:50] <wikibugs>	 (03CR) 10Dzahn: "https works here, let's use it wherever possible, amending" [puppet] - 10https://gerrit.wikimedia.org/r/357254 (owner: 10Nemo bis)
[21:23:50] <kaldari>	 RainbowSprinkles: Beta Cluster has the same issue
[21:23:52] <kaldari>	 https://simple.wikipedia.beta.wmflabs.org/wiki/Special:Preferences
[21:24:30] <RainbowSprinkles>	 So nobody's fixed it yet in master, ok.
[21:25:02] <kaldari>	 Guess I'll add it to the deployment blockers
[21:25:20] <RainbowSprinkles>	 Yeah please. I'll look at this some more in a bit, gotta run to the post office
[21:29:31] <Zppix>	 twentyafterfour:  are you about?
[21:30:42] <wikibugs>	 (03PS3) 10Dzahn: [Planet Wikimedia] Add blog.wikimedia.gr to Greek Planet [puppet] - 10https://gerrit.wikimedia.org/r/357254 (owner: 10Nemo bis)
[21:35:50] <wikibugs>	 (03CR) 10Dzahn: [C: 032] [Planet Wikimedia] Add blog.wikimedia.gr to Greek Planet [puppet] - 10https://gerrit.wikimedia.org/r/357254 (owner: 10Nemo bis)
[21:38:30] <tgr>	 TabbyCat: o/
[21:41:53] <mutante>	 !log contint1001 - graceful'ed Apache to deploy gerrit:351391
[21:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:20] <TabbyCat>	 tgr: can you check https://phabricator.wikimedia.org/T167219 ?
[21:46:10] <wikibugs>	 (03PS5) 10Dzahn: flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 (owner: 10BryanDavis)
[21:47:26] <tgr>	 TabbyCat: do you have that log open?
[21:47:53] <TabbyCat>	 tgr: nope :(
[21:49:17] <TabbyCat>	 tgr: I have IP/CIDR if it helps
[21:50:53] <tgr>	 thx, found it
[21:51:42] <TabbyCat>	 okay, I'll be around for a couple of minutes, if you need something let me know tgr and I'll see if I can help
[21:53:34] <RainbowSprinkles>	 !log gerrit: restarting to test a config tweak
[21:53:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:03] <wikibugs>	 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3321280 (10Paladox)
[21:55:31] <paladox>	 RainbowSprinkles what config are you testing? :)
[21:56:35] <RainbowSprinkles>	 doesn't matter, didn't work
[21:56:50] <icinga-wm>	 PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[21:57:00] <RainbowSprinkles>	 !log gerrit: restarting last time, didn't work like I wanted
[21:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:05] <wikibugs>	 (03CR) 10Volans: "> I am surprised by how fast the processing is done on my machine." (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans)
[22:00:23] <wikibugs>	 (03PS4) 10Volans: Tox: find and check Python files without extension [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169)
[22:00:52] <paladox>	 RainbowSprinkles we can close https://phabricator.wikimedia.org/T158946 as resolved now?
[22:01:45] <paladox>	 thanks :)
[22:04:40] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2038941
[22:05:00] <wikibugs>	 10Operations, 10Gerrit, 10Beta-Cluster-reproducible, 10Patch-For-Review, and 2 others: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#3321381 (10demon) 05Open>03Resolved This shouldn't actually be a problem anymore.
[22:09:46] <wikibugs>	 (03CR) 10Bearloga: Add Shiny Server module and Discovery Dashboards role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga)
[22:10:59] <wikibugs>	 (03PS1) 10Eevans: WIP: throwing things against Puppet Compiler to see what sticks [puppet] - 10https://gerrit.wikimedia.org/r/357515 (https://phabricator.wikimedia.org/T167222)
[22:13:52] <wikibugs>	 (03CR) 10Dzahn: [C: 032] flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 (owner: 10BryanDavis)
[22:15:57] <wikibugs>	 10Operations, 10Gerrit, 10Beta-Cluster-reproducible, 10Release-Engineering-Team (Kanban), 10Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#3321443 (10Paladox)
[22:17:44] <wikibugs>	 (03CR) 10Chad: "Ignore this comment, posting for an example task." (031 comment) [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad)
[22:18:01] <urandom>	 is it not possible to use puppet compiler for deployment-prep in labs?
[22:20:08] <wikibugs>	 (03CR) 10Paladox: [C: 031] Add core + core plugins @ 2.13.8 [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad)
[22:20:16] <wikibugs>	 (03CR) 10Paladox: [C: 031] Configuring git-fat to work with Archiva [software/gerrit] - 10https://gerrit.wikimedia.org/r/356482 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad)
[22:20:19] <wikibugs>	 (03CR) 10Paladox: [C: 031] Adding scap3 config [software/gerrit] - 10https://gerrit.wikimedia.org/r/356484 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad)
[22:20:35] <paladox>	 I reviewed those a while ago ^^ i am only adding +1 now :)
[22:23:50] <icinga-wm>	 RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[22:25:21] <urandom>	 mutante: ping?
[22:26:02] <wikibugs>	 (03CR) 10Dzahn: "feel free to re-add me if any changes" [puppet] - 10https://gerrit.wikimedia.org/r/145018 (owner: 10ArielGlenn)
[22:28:51] <wikibugs>	 (03PS3) 10Bearloga: Add Shiny Server module and Discovery Dashboards role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354)
[22:31:12] <wikibugs>	 10Operations, 10Labs, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Set up external DNS record for wikitech-static - https://phabricator.wikimedia.org/T164290#3321486 (10RobH)
[22:32:18] <mutante>	 urandom: not that i'd know of. it looks at site.pp to find the nodes and there is none in labs
[22:32:37] <mutante>	 could be that i just dont know how though
[22:32:58] <mutante>	 saw https://wikitech.wikimedia.org/wiki/Puppet_migration#Puppet_Catalogs_compiler  that is about setting it up in vagrant
[22:33:05] <urandom>	 mutante: oh, yeah.  i was going to ask how familiar you are with the cassandra puppetization since _joe_ refactored it
[22:33:17] <mutante>	 ah, i was trying to answer the compiler question
[22:33:34] <mutante>	 i am not familiar with the cassandra puppetization in particular
[22:33:36] <urandom>	 yeah, i think i came to the same conclusion :(
[22:33:55] <mutante>	 did you have a particular issue ?
[22:34:08] <urandom>	 yeah, in deployment prep we have two cassandra clusters that are crossed
[22:34:11] <urandom>	 they've... merged
[22:34:25] <urandom>	 because one of them got some seeds mixed in from the other
[22:34:42] <urandom>	 which i'm guessing is the result of inheritance
[22:35:03] <urandom>	 but everything has sort of changed here
[22:35:30] <mutante>	 i _think_ he would try to avoid inheritance
[22:35:31] <urandom>	 the PC question was because i was going to iterate on some educated guesses :)
[22:35:38] <mutante>	 what are the names of the crossed ones?
[22:36:25] <wikibugs>	 10Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917#3321498 (10Jdlrobson)
[22:36:27] <wikibugs>	 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Reading-Web-Backlog, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3321497 (10Jdlrobson)
[22:36:32] <urandom>	 it's the restbase cluster (consisting of restbase01 and restbase02), and the aqs one (consisting of aqs0[1-3])
[22:37:29] <urandom>	 mutante: i was looking at hieradata/labs/deployment-prep/common.yaml, and profile::cassandra::instances lists the two restbase nodes
[22:37:47] <mutante>	 ok, and what are the names of the roles that they are using
[22:38:03] <urandom>	 hrmm, restbase and aqs, i think
[22:38:27] <mutante>	 hmm. then it would be all different from production
[22:39:35] <mutante>	 so the profile should be used inside a role
[22:39:46] <mutante>	 and the role should be on the instance
[22:40:33] <mutante>	 and what is in hieradata/labs/deployment-prep/common.yaml would probably be applied to the whole deployment-prep project then
[22:40:46] <mutante>	 that would mean it's not based on the role
[22:41:22] <mutante>	 so that could explain why you see things from there on all the instances
[22:42:05] <mutante>	 in prod, if something is in hieradata/role/common/ it gets applied on all nodes using that role
[22:42:08] <urandom>	 yeah, the restbase nodes don't have the aqs nodes in their seeds list, but the aqs nodes have the other aqs nodes and the restbase nodes
[22:42:16] <mutante>	 unfortunately hieradata/labs follows a different approach 
[22:42:35] <urandom>	 yeah, that was confusing me
[22:43:00] <mutante>	 i agree
[22:43:14] <mutante>	 the best fix would be to make it more similar i think
[22:43:35] <mutante>	 hiera lookup based on role instead of project, and then apply roles on individual instances
[22:43:48] <wikibugs>	 (03Abandoned) 10Eevans: WIP: throwing things against Puppet Compiler to see what sticks [puppet] - 10https://gerrit.wikimedia.org/r/357515 (https://phabricator.wikimedia.org/T167222) (owner: 10Eevans)
[22:44:07] <mutante>	 in horizon you can also do either or, apply a puppet role by instance, by project or even by prefix of the hostname
[22:45:02] <mutante>	 but that feels like a bigger thing to restructure the whole deployment-prep setup 
[22:46:34] <mutante>	 6
[22:48:32] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] aptrepo: add hp-mcp-stretch and thirdparty/hwraid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi)
[22:53:40] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608)
[22:54:32] <wikibugs>	 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3321579 (10Dzahn)
[22:55:20] <wikibugs>	 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301040 (10Dzahn) 05Open>03Resolved great! thank you. i have removed the network access part from the onboarding.   that means all subtasks are resolved and closing this.
[22:58:09] <wikibugs>	 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3321589 (10Dzahn) this would be like T125821 was for jessie
[22:58:46] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3321592 (10faidon) Well, first of all, right before I filed this task, Antoine said on IRC: > containers for CI would be for later. The priority has been set toward sta...
[23:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T2300).
[23:00:04] <jouncebot>	 Jdlrobson, Smalyshev, and matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:17] <jdlrobson>	 \o
[23:00:23] <matt_flaschen>	 Present
[23:03:10] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[23:07:31] <thcipriani>	 I can SWAT
[23:08:10] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 15 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[23:08:17] <wikibugs>	 (03PS2) 10Thcipriani: Update ContentNamespaces for Commons Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077) (owner: 10Jdlrobson)
[23:08:25] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077) (owner: 10Jdlrobson)
[23:09:39] <wikibugs>	 (03Merged) 10jenkins-bot: Update ContentNamespaces for Commons Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077) (owner: 10Jdlrobson)
[23:09:48] <wikibugs>	 (03CR) 10jenkins-bot: Update ContentNamespaces for Commons Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077) (owner: 10Jdlrobson)
[23:10:08] <wikibugs>	 (03CR) 10Dzahn: "alright, got back to this one and had to remember myself what we said here. so if the package does provide the traditional sysvinit init s" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox)
[23:10:15] <TabbyCat>	 Reedy: around?
[23:10:34] <thcipriani>	 jdlrobson: contentnamespaces patch is live on mwdebug1002, check please
[23:10:39] <jdlrobson>	 on it!
[23:10:59] <jdlrobson>	 it works thcipriani 
[23:11:02] <jdlrobson>	 sync away
[23:11:06] * thcipriani does
[23:12:41] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:357426|Update ContentNamespaces for Commons Wiki]] T167077 (duration: 00m 46s)
[23:12:47] <thcipriani>	 ^ jdlrobson live now
[23:12:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:53] <stashbot>	 T167077: Use wgContentNamespaces instead of $wgMFContentNamespace - https://phabricator.wikimedia.org/T167077
[23:13:01] <wikibugs>	 (03PS2) 10Thcipriani: Disable page previews on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894) (owner: 10Jdlrobson)
[23:13:03] <jdlrobson>	 looks good!
[23:13:07] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894) (owner: 10Jdlrobson)
[23:14:24] <wikibugs>	 (03Merged) 10jenkins-bot: Disable page previews on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894) (owner: 10Jdlrobson)
[23:15:13] <thcipriani>	 jdlrobson: ^ is live on mwdebug1002, check please
[23:15:19] <jdlrobson>	 on it
[23:15:27] <thcipriani>	 SMalyshev: ping for SWAT
[23:15:41] <wikibugs>	 (03CR) 10jenkins-bot: Disable page previews on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894) (owner: 10Jdlrobson)
[23:15:42] <jdlrobson>	 thcipriani: that's also good
[23:15:48] * thcipriani syncs
[23:17:17] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "i'll turn -2 into -1 for now, it's a good point but we gotta make sure there is no difference between the 2 files" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox)
[23:17:17] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:357425|Disable page previews on wikispecies]] T166894 (duration: 00m 44s)
[23:17:24] <thcipriani>	 ^ jdlrobson live now
[23:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:25] <stashbot>	 T166894: Disable page previews on wikispecies - https://phabricator.wikimedia.org/T166894
[23:17:52] <jdlrobson>	 thcipriani: all good!
[23:18:00] <thcipriani>	 cool, thanks for checking :)
[23:18:25] <thcipriani>	 blerg, should have started merging these flow patches sooner :\
[23:24:22] <Krinkle>	 https://test2.wikipedia.org/wiki/Main_Page - 500 - /php-1.30.0-wmf.4/includes/parser/Parser.php: Tag hook for noexternallanglinks is not callable
[23:25:32] <Krinkle>	 Wikidata breaking things?
[23:26:42] <Krinkle>	 thcipriani: ^
[23:27:13] <thcipriani>	 oh good.
[23:29:01] <Krinkle>	 Filed a task just now
[23:29:18] <Krinkle>	 subtask of wmf.4 blockers
[23:29:28] <Krinkle>	 https://phabricator.wikimedia.org/T167238
[23:29:31] <thcipriani>	 Krinkle: thanks, RainbowSprinkles ^ FYI
[23:30:23] <thcipriani>	 I think we're just on testwikis with wmf.4 so I may leave it for folks to investigate. MediaWiki is on wmf.2
[23:30:36] <RainbowSprinkles>	 Krinkle: I already rolled mw.org back to wmf.2 because of T167216
[23:30:36] <stashbot>	 T167216: Preferences page messed up on Test Wikipedia (1.30.0-wmf.4) - https://phabricator.wikimedia.org/T167216
[23:30:45] <RainbowSprinkles>	 (so only running on test(2) and friends)
[23:30:54] <Krinkle>	 OK
[23:33:22] <thcipriani>	 matt_flaschen: hrrrm, jenkins didn't like your patches for SWAT for some reason :\
[23:34:28] <thcipriani>	 oh, composer + https://status.github.com/
[23:35:08] <matt_flaschen>	 thcipriani, yeah, just checked, both are showing that.
[23:35:24] <wikibugs>	 10Operations, 10DNS, 10Traffic: Redirect status.wikipedia.org to status.wikimedia.org - https://phabricator.wikimedia.org/T167239#3321697 (10Ladsgroup)
[23:37:35] <matt_flaschen>	 thcipriani, I re-did the gate, but I'm not crossing my fingers.  This is a bad bug, but it's not a new bug, so not sure.  I lean towards waiting until it can gate normally. ^ RoanKattouw
[23:39:03] <matt_flaschen>	 thcipriani, it says it's recovering now: https://status.github.com/
[23:39:05] <matt_flaschen>	 19:38 EDT
[23:39:06] <matt_flaschen>	 Our systems are recovering from the interruption of one of our core data services. 
[23:39:48] <RoanKattouw>	 16:24:21 <Krinkle> https://test2.wikipedia.org/wiki/Main_Page - 500 - /php-1.30.0-wmf.4/includes/parser/Parser.php: Tag hook for noexternallanglinks is not callable
[23:39:48] <RoanKattouw>	 16:25:31 <Krinkle> Wikidata breaking things?
[23:39:56] <RoanKattouw>	 matt_flaschen: Is that related to your/our Wikidata change for RCF? ---^^
[23:40:17] <RoanKattouw>	 I think ours broke noexternallanglinks, not to say that it must be our change that broke it now, but it is suspicious
[23:40:26] <RoanKattouw>	 Maybe the Wikidata people refactored it and broke it, but who knows
[23:43:01] <matt_flaschen>	 RoanKattouw, I was also suspicious and wondering about that.  Our patch was almost 3 months ago so I put it back down (thinking it probably wasn't broken that long), but I'll check for sure (it's probably not a widely used magic word, and maybe someone just recently added it to test2)
[23:43:13] <matt_flaschen>	 I can probably track it down now regardless.
[23:46:51] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[23:52:20] <thcipriani>	 matt_flaschen: changes for flow for wmf.2 and wmf.4 live on mwdebug1002, check please
[23:55:03] <RainbowSprinkles>	 !log gerrit: force stopping for a second to reindex accounts
[23:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:10] <matt_flaschen>	 thcipriani, works, good to go for everywhere.
[23:56:16] <thcipriani>	 matt_flaschen: ok, going live
[23:56:17] <Amir1>	 gerrit acts like it's down
[23:56:22] <matt_flaschen>	 Posted a test topic at https://gom.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BE%E0%A4%AA%E0%A4%B0%E0%A4%AA%E0%A5%80_%E0%A4%9A%E0%A4%B0%E0%A5%8D%E0%A4%9A%E0%A4%BE:STACEY_MESQUITA then hid it.
[23:56:23] <RainbowSprinkles>	 !log gerrit: back from reindexing
[23:56:26] <Amir1>	 nope
[23:56:29] <Amir1>	 false alarm
[23:56:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:32] <RainbowSprinkles>	 Amir1: See SAL, on purpose for 2 seconds :)
[23:56:33] <RainbowSprinkles>	 Bad timing
[23:56:52] <Amir1>	 It seems I always am
[23:57:09] <Amir1>	 :P
[23:58:23] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.30.0-wmf.4/extensions/Flow/includes/Content/BoardContentHandler.php: SWAT: [[gerrit:357501|Revert "Throw when unserializing invalid Flow workflow metadata JSON"]] T166100 T156813 (duration: 00m 45s)
[23:58:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:32] <stashbot>	 T166100: MWContentSerializationException: Failed to decode blob.  It should be JSON representing valid Flow metadata. - https://phabricator.wikimedia.org/T166100
[23:58:32] <stashbot>	 T156813: MWContentSerializationException in Konkani Wikipedia (gomwiki) - https://phabricator.wikimedia.org/T156813
[23:59:10] <icinga-wm>	 PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[23:59:43] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.30.0-wmf.2/extensions/Flow/includes/Content/BoardContentHandler.php: SWAT: [[gerrit:357500|Revert "Throw when unserializing invalid Flow workflow metadata JSON"]] T166100 T156813 (duration: 00m 43s)
[23:59:49] <thcipriani>	 ^ matt_flaschen live everywhere
[23:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log