[00:30:54] 10Operations, 10Labs, 10Labs-Infrastructure, 10labs-sprint-117, 10labs-sprint-118: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#3317065 (10chasemp) This task was to make a plan for user mgmt access to bare metal as a service @dzahn to help clarify, which we have... [00:43:38] (03CR) 10Kaldari: [C: 031] "Looks good to me. Let's schedule for later this week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007) (owner: 10MaxSem) [01:36:24] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3317112 (10tstarling) [01:39:39] 10Operations, 10Labs, 10Labs-Infrastructure, 10labs-sprint-117, 10labs-sprint-118: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#3317114 (10Dzahn) Got it, thank you both. Yep! [02:21:34] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 32s) [02:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:38] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jun 6 02:27:37 UTC 2017 (duration 6m 3s) [02:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:16] 10Operations, 10Wikimedia-General-or-Unknown, 10I18n: wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there - https://phabricator.wikimedia.org/T166782#3317206 (10whym) [03:38:36] PROBLEM - Disk space on ms-be1016 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error [03:39:06] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 62220 MB (12% inode=99%) [03:41:36] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sdh1] [03:54:06] RECOVERY - Disk space on elastic1019 is OK: DISK OK [04:01:36] RECOVERY - Disk space on ms-be1016 is OK: DISK OK [04:08:36] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [05:08:36] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. [05:41:36] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [05:45:04] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317260 (10Marostegui) This went back to faulty again: ``` BatteryType: BBU Battery State: Unknown Battery backup charge time : 0 hours ``` Raid went back to WriteThrough: ``` Default Cache... [05:51:36] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [05:54:03] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317261 (10Marostegui) And it is back: ``` 05:51 < icinga-wm> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy Default Cache Policy: WriteBack, Read... [05:56:11] !log Deploy alter table s3 on db1075 (eqiad master) - T166278 [05:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:21] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:13:00] (03PS8) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [06:33:44] (03PS1) 10Marostegui: db-eqiad.php: Add comment about db1089 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357341 (https://phabricator.wikimedia.org/T166935) [06:36:26] PROBLEM - Check systemd state on elastic2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:37:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add comment about db1089 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357341 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [06:38:32] (03Merged) 10jenkins-bot: db-eqiad.php: Add comment about db1089 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357341 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [06:38:41] (03CR) 10jenkins-bot: db-eqiad.php: Add comment about db1089 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357341 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [06:40:06] 10Operations, 10ops-codfw: mw2221 stuck after reboot - https://phabricator.wikimedia.org/T165734#3317286 (10MoritzMuehlenhoff) 05Open>03Resolved Closing, that host is running and repooled for a while now. [06:40:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comments about current status of db1089 - T166935 (duration: 00m 39s) [06:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:18] T166935: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935 [06:49:27] PROBLEM - Check the NTP synchronisation status of timesyncd on elastic2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:11] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317309 (10elukey) >>! In T166141#3315357, @jcrespo wrote: > Not really, we have almost decided the goals for Q1, and they are all quite urgent and for hardware that ha... [07:10:48] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317327 (10Marostegui) >>! In T166141#3317309, @elukey wrote: >>>! In T166141#3315357, @jcrespo wrote: >> Not really, we have almost decided the goals for Q1, and they... [07:11:36] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [07:12:47] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317328 (10Marostegui) And again: ``` ˜/icinga-wm 9:11> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough ``` [07:27:49] moritzm: Guten Tag. Looks like hhvm-dbg wmf4+exp1 is not available on apt.wikimedia.org :D [07:28:09] that breaks puppet / apt-get install on the beta cluster instances which have hhvm-dbg installed: [07:28:10] hhvm-dbg : Depends: hhvm (= 3.18.2+dfsg-1+wmf4) but 3.18.2+dfsg-1+wmf4+exp1 is installed [07:29:46] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:47] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:47] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:30:46] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [07:30:46] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:30:46] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [07:32:14] hashar: which host is that? wmf+exp1 is intentionally not on apt.wikimedia.org, it was an experimental build to investigate the behaviour of persistant connections as used by the job runners [07:32:32] moritzm: deployment-jobrunner02.deployment-prep.eqiad.wmflabs [07:32:48] those tests are completed, so if it's still around, I'll simply downgrade to +wmf4 [07:33:00] ahh make sense. thanks [07:35:39] fixed [07:38:22] hashar_: my fault! I was testing a new hhvm version for the connect timeouts! [07:38:36] no worries :-} [07:38:47] has the experience been any helpful ? [07:39:56] PROBLEM - IPMI Temperature on labsdb1003 is CRITICAL: Sensor Type(s) Temperature Status: Critical [Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processo [07:41:26] PROBLEM - puppet last run on elastic2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:43:23] (03PS1) 10Hashar: beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344 [07:45:30] (03PS2) 10Hashar: beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344 [07:46:46] (03CR) 10Hashar: "puppet fails on deployment-prep instances since the hiera "role" hierarchy is not looked up. https://gerrit.wikimedia.org/r/#/c/357344/ sh" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [07:49:04] (03PS1) 10DCausse: Test that replica counts are within sane bounds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345 [07:49:26] RECOVERY - Check systemd state on elastic2014 is OK: OK - running: The system is fully operational [07:50:31] (03CR) 10DCausse: [C: 031] "lgtm," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [07:53:00] !log starting upgrade to elasticsearch 5.3.2 on cirrus eqiad cluster - T163708 [07:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:09] T163708: Upgrade the production search cluster to elastic 5.3.2 - https://phabricator.wikimedia.org/T163708 [07:55:05] (03CR) 10Elukey: "Tested the script (with echo instead of restart of course) on rdb1001 and rdb1002, everything works as expected (only the latter prints re" [puppet] - 10https://gerrit.wikimedia.org/r/357193 (owner: 10Giuseppe Lavagetto) [07:56:54] hashar: re "has the exp been helpful" - not really, I didn't manage to fix the issue depicted in https://github.com/facebook/hhvm/issues/7854 but I have a better idea about what is happening. I feel that I am missing something trivial [07:57:24] (03Abandoned) 10DCausse: [wikitech] Increase weight on Tool and Nova Resource ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354474 (https://phabricator.wikimedia.org/T165725) (owner: 10DCausse) [08:04:04] (03CR) 10Alexandros Kosiaris: [C: 032] Do not confine LLDP fact to physical/non-VMs [puppet] - 10https://gerrit.wikimedia.org/r/354108 (owner: 10Faidon Liambotis) [08:04:09] (03PS4) 10Alexandros Kosiaris: Do not confine LLDP fact to physical/non-VMs [puppet] - 10https://gerrit.wikimedia.org/r/354108 (owner: 10Faidon Liambotis) [08:04:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Do not confine LLDP fact to physical/non-VMs [puppet] - 10https://gerrit.wikimedia.org/r/354108 (owner: 10Faidon Liambotis) [08:07:14] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091#3317363 (10Gehel) [08:07:42] (03PS2) 10Volans: Tox: find and check Python files without extension [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) [08:14:27] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317379 (10elukey) Sure I am concerned too, this is why I asked if it was possible to order the hardware as soon as possible to be ready to work on it by the end of Q1 :) [08:14:51] (03PS3) 10Elukey: Correct pageview_hourly loading scheme on pivot home [puppet] - 10https://gerrit.wikimedia.org/r/357315 (owner: 10Nuria) [08:17:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:18:06] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:18:36] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [08:19:26] PROBLEM - Check the NTP synchronisation status of timesyncd on elastic2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:28] (03CR) 10Elukey: [C: 032] Correct pageview_hourly loading scheme on pivot home [puppet] - 10https://gerrit.wikimedia.org/r/357315 (owner: 10Nuria) [08:20:45] (03PS2) 10Alexandros Kosiaris: network: Add kubernetes pod/service IPs [puppet] - 10https://gerrit.wikimedia.org/r/341792 [08:21:26] PROBLEM - Check systemd state on elastic2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:22:12] 10Operations, 10Continuous-Integration-Config, 10Operations-Software-Development, 10Patch-For-Review: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169#2590514 (10Joe) >>! In T144169#2836235, @fgiunchedi wrote: > After some discussion in https://gerrit.wik... [08:22:54] (03CR) 10Alexandros Kosiaris: [C: 032] network: Add kubernetes pod/service IPs [puppet] - 10https://gerrit.wikimedia.org/r/341792 (owner: 10Alexandros Kosiaris) [08:22:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:22:58] (03PS3) 10Alexandros Kosiaris: network: Add kubernetes pod/service IPs [puppet] - 10https://gerrit.wikimedia.org/r/341792 [08:23:03] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] network: Add kubernetes pod/service IPs [puppet] - 10https://gerrit.wikimedia.org/r/341792 (owner: 10Alexandros Kosiaris) [08:24:35] gehel: /var/log/elasticsearch/production-search-codfw.log depleted the root partiton on elastic2014 [08:25:03] moritzm: yep, I'm on it with dcausse. We are trying to understand what went wrong before truncating the logs... [08:25:24] ok [08:30:47] (03PS3) 10Volans: Tox: find and check Python files without extension [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) [08:33:12] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091#3317408 (10dcausse) The doc `1615532` is in the general index at [[https://ja.wikipedia.org/wiki/%E3%83%8E%E3%83%BC%E3%83%88:%E9%80%9F%E6%B0%B4%E5%A4%AA%E9... [08:34:26] RECOVERY - Check systemd state on elastic2014 is OK: OK - running: The system is fully operational [08:37:27] (03PS31) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [08:39:22] !log raise log level to WARN for TransportShardBulkAction on elasticsearch cirrus - T167091 [08:39:26] RECOVERY - puppet last run on elastic2014 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [08:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:30] T167091: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091 [08:39:43] (03PS32) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [08:43:03] !log stopping db2035 and preparing for reimage [08:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:23] (03CR) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [08:48:53] (03CR) 10Elukey: "Added another round of refactoring to eliminate the old zookeeper_cluster_name global variable from the profile zookeeper server." [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [08:50:49] (03PS4) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) [08:51:44] (03PS5) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) [08:54:52] (03CR) 10Giuseppe Lavagetto: "I would rename the file to a profile, as it's more of a profile (that could be included in all labstore roles, if there is more than one)." [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [08:54:59] !log restarting elastic2014 to reclaim free space on deleted log file [08:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:56] RECOVERY - Disk space on elastic2014 is OK: DISK OK [09:00:05] akosiaris and hashar: Respected human, time to deploy Jobrunner service to scap3 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T0900). Please do the needful. [09:00:21] hashar: ok, I am starting the dance :-) [09:00:30] sure [09:00:38] and as far as I can tell, scap only supports a single service :/ [09:02:07] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317454 (10Marostegui) db1075 the master is done - the whole shard is completed. [09:02:24] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317455 (10Marostegui) ^ Wrong ticket [09:02:34] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3317456 (10mmodell) [09:03:14] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3215445 (10mmodell) @robh: Thanks, this is on my radar. The current plan is to switch production to phab2001.codfw temporarily, then switch back from there to phab1... [09:03:36] (03PS7) 10Alexandros Kosiaris: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [09:04:17] !log disable puppet on all jobrunners [09:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:34] actually let's do this correctly [09:04:39] !log disable puppet on all jobrunners T129148 [09:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:47] T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148 [09:05:28] (03CR) 10Alexandros Kosiaris: [C: 032] Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [09:05:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [09:08:00] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3317477 (10mmodell) [09:08:17] running puppet on tin T129148 [09:08:23] !log running puppet on tin T129148 [09:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:52] !log git pull and scap deploy --init for jobrunner T129148 [09:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:59] T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148 [09:12:58] !log running puppet on mw1161 T129148 [09:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:46] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [09:15:00] on tin files in /srv/deployment/jobrunner/jobrunner are still owned by trebuchet user :/ [09:15:03] XioNoX: ^^^ [09:15:26] hashar: yeah I think I 'll move the repo and have puppet recreate it [09:15:42] instead of messing with the ownerships manually [09:15:48] ;-D [09:15:52] volans: yeah, that's the zayo circuit we're getting emails about, zayo is working on it, fiber cut [09:16:12] yeah I was wondering why it alarmed again [09:16:13] on mw1161 at least it's owned by mwdeploy [09:16:19] or is recovering? :D [09:17:07] if the message changes it actually alarms again ;) [09:17:35] does it ? I don't remember so [09:17:43] it is not true [09:17:51] slave lag changes every time [09:18:02] and it only alarms once [09:18:13] you may be confused with passive alerts [09:18:16] maybe we have multiple notifications for that ? [09:18:18] I might remember wrong then... mumble mumble [09:18:32] right jynus only for passive, my bad [09:18:32] or special cases like CRIT -> WARN -> CRIT [09:18:39] anyway back to my migration, will look into it later if you guys haven't figured it out by then [09:19:17] RECOVERY - Check the NTP synchronisation status of timesyncd on elastic2014 is OK: OK: synced at Tue 2017-06-06 09:19:15 UTC. [09:19:42] !log running puppet again on tin, after moving /serv/deployment/jobrunner/jobrunner T129148 [09:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:51] T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148 [09:19:54] :-} [09:20:35] still owned by trebuchet do [09:21:26] hashar: actually, it looks like this is correct [09:21:35] all repos are owned by trebuchet [09:21:38] so I made you move the repo for nothing? [09:21:54] no worries, it probably fixed some permissions anyway [09:22:18] things like adding +s on g [09:22:43] or something anyway. I am not gonna cargo cult on this one [09:22:49] sure [09:23:28] so I guess lets try on canaries hosts? They are mw1299.eqiad.wmnet mw2247.codfw.wmnet [09:23:41] no, not yet [09:23:50] I first have to run puppet over there [09:24:04] but it's good to run across all of them judging from mw1161 [09:25:33] !log moving around jobrunner/jobrunner was probably not required T129148 [09:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:42] T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148 [09:25:43] !log running puppet on videoscalers T129148 [09:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:00] !log running puppet on jobrunners T129148 [09:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:19] hashar: ok, I am running the first scap deploy -v [09:31:24] \O/ [09:31:30] !log akosiaris@tin Started deploy [jobrunner/jobrunner@161c84c]: (no justification provided) [09:31:37] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [09:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:48] canary deploy successful. Continue? [y]es/[n]o/[c]ontinue all groups: [09:31:50] :-) [09:32:22] so on mw1299 the jobrunner service got restarted [09:32:32] and jobchron is left behind. Will have to figure out a solution for that later on [09:32:48] !log akosiaris@tin Finished deploy [jobrunner/jobrunner@161c84c]: (no justification provided) (duration: 01m 17s) [09:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:55] done [09:33:27] (03CR) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation (034 comments) [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [09:33:34] akosiaris: on the whole fleet ? [09:33:45] !log restart jobchron service across jobrunners T129148 [09:33:49] hashar: yup [09:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:55] T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148 [09:34:00] hashar: 8 groups in total [09:34:14] no, 9 if we count the canary group [09:34:25] looks fine across all of them [09:34:40] apparently yes [09:35:08] (03PS3) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 [09:35:15] !log restart jobchron service across videoscalers T129148 [09:35:17] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:45] <_joe_> akosiaris: videoscalers are trustys btw [09:35:52] _joe_: yes I know [09:36:33] so... here's an interesting twist [09:36:54] jobrunners in codfw should not be running either jobchron.service or jobrunner.service [09:37:43] the jobrunner service on mw2153 exited 143 apparently [09:37:56] <_joe_> akosiaris: yes [09:38:12] <_joe_> akosiaris: both need to be stopped. [09:38:17] PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:38:36] yeah doing so now [09:39:17] PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:41:08] we still want to deploy the code on both eqiad and codfw hosts dont we ? [09:41:12] yes [09:41:28] should the services be masked on codfw host so? [09:41:39] !log stop jobchron/jobrunner processes across jobrunner and videoscalers in codfw [09:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:17] RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational [09:43:17] RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational [09:43:17] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [09:43:26] !log installing perl security updates [09:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:37] (03PS1) 1020after4: Use maniphest.edit in phab_epipe.py [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043) [09:43:54] hashar: so the services are stopped and disabled [09:43:58] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [09:43:58] but not masked ... [09:44:07] as it they can be start on a whim [09:44:24] and disabled state would prevent scap from restarting, wouldn't it? [09:44:32] no [09:44:57] disabled/enabled is orthogonal to start/stop/allowed_to_start/allowed_to_stop [09:45:09] it's only related to what happens after a boot [09:45:22] enabled => will be started on boot, disabled => nope [09:45:22] (03CR) 1020after4: "This is not particularly urgent." [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043) (owner: 1020after4) [09:47:11] hashar: and mask would not work on the videoscalers [09:47:20] they are trusty.. no systemd there, hence no mask [09:48:11] hashar: turns out we have at least 2 actionables [09:48:21] 1) scap should allow restarting multiple services [09:48:35] 2) figure out how to not restart jobrunner/jobchron in the non-active DC [09:49:14] 2) should probably not be done in scap but rather in puppet [09:49:55] _joe_: perhaps we should not be shipping jobchron and jobrunner systemd/upstart units on the non-active DC ? [09:50:11] :( [09:50:54] for multiple dc, I wonder how parsoid solves that [09:51:11] what do you mean ? [09:51:19] parsoid runs on both DCs [09:51:28] it has no state to manage [09:51:38] ahh [09:54:50] 10Operations, 10Commons, 10Wikimedia-Site-requests, 10media-storage, 10Patch-For-Review: Server side upload for Yann - https://phabricator.wikimedia.org/T166806#3317566 (10fgiunchedi) Thanks everyone for your help in debugging this, @Yann did uploading other files with e.g. v2c worked eventually? I see @... [09:55:34] akosiaris: I am filling a task for "scap should allow restarting multiple services" [09:55:41] hashar: ok [09:55:45] thanks! [09:58:56] hashar: ok I think we are done, aside from the actionables above. I am moving on to a different task, lemme know if you need something [10:03:27] akosiaris: want me to fill the 2) one about not restarting in non-active DC ? [10:09:44] hashar: yeah sure. thanks! [10:09:49] appreciated :-) [10:10:02] I am actually looking into the service provider [10:10:16] looks like "mask" is supported in some versions [10:10:48] but not 3.18 that we got :-( [10:10:53] 3.8* [10:12:04] akosiaris: https://phabricator.wikimedia.org/T167104 and you are on cc [10:12:05] 10Operations, 10Deployment-Systems, 10MediaWiki-JobRunner, 10scap2, and 2 others: figure out how to not restart jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3317757 (10hashar) [10:12:14] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:12:27] https://github.com/puppetlabs/puppet/commit/1e2a71604e184477f94d516d86366adf1fef2452 [10:12:33] on 4.2.0 and later [10:12:35] and we have videoscalers still on trusty [10:12:38] yes [10:12:48] so we need a different solution than mask [10:13:06] that works across distros and supports puppet 3.8 [10:14:07] and some hosts are still on puppet 3.7 ? [10:14:26] you're forgetting the actionable of upgrading those to trusty :) [10:14:36] moritzm: what's the status of that? I think you were working on it last [10:16:27] 10Operations, 10Deployment-Systems, 10MediaWiki-JobRunner, 10scap2, and 2 others: figure out how to not restart jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3317805 (10akosiaris) I 've had a quick look into the `mask` feature of systemd. That should allow us to mark a... [10:16:37] hashar: no puppet 3.7 hosts [10:17:03] paravoid: that's irrelevant. we can't really use nicely the mask feature anyway. See ^ [10:17:13] we are btw kind of abusing it in maps IIRC [10:17:29] yeah well, we should be doing that anyway [10:17:32] we allow the dev to mask the services in order to avoid having them restarted by puppet [10:17:48] but that will have to change as it seems ;) [10:17:57] it's blocked by the HHVM 3.18 memory corruption exposed by luasandbox, HHVM developers acknowledged that earlier the day, but not patch available yet [10:18:15] ah [10:18:38] we can't do it with hhvm 3.12? [10:20:11] 10Operations, 10DBA: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#1761444 (10Marostegui) I have installed `wmf-mariadb101_10.1.23-1_amd64.deb` on a fresh stretch to play around with it - will get back to you if I see issues! [10:20:50] I'd rather not, 3.12 is EOLed and has a couple of bugs fixed in 3.18, so better start with the current version from the start [10:21:14] 10Operations, 10Continuous-Integration-Config, 10Operations-Software-Development, 10Patch-For-Review: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169#3317841 (10fgiunchedi) >>! In T144169#3317402, @Joe wrote: >> Re: naming, I think an obvious convention... [10:22:15] !log installing NSS security updates [10:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:06] but aren't they orthogonal transitions by nature? [10:28:34] PROBLEM - OTRS SMTP on mendelevium is CRITICAL: connect to address 10.64.32.174 and port 25: Connection refused [10:28:43] (03PS1) 10Ema: check_ipmi_temp: turn off sel checking [puppet] - 10https://gerrit.wikimedia.org/r/357361 (https://phabricator.wikimedia.org/T125205) [10:28:44] PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:29:08] somewhat, but the 3.12-3.18 move doesn't add particular risk either [10:30:04] Amir1: Respected human, time to deploy Deploy new wb_terms configs to testwikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1030). Please do the needful. [10:30:28] akosiaris hashar can you let me know when your migration is finished? I have a scap update lined up but didn't want to do it at the same time [10:31:12] I can put pause on my deployment [10:31:33] Amir1: no worries I can wait, not urgent [10:31:36] ping me when done tho [10:32:10] kk [10:32:17] godog: we are done [10:33:08] okay, I start the deployment [10:33:56] (03CR) 10Marostegui: [C: 031] check_ipmi_temp: turn off sel checking [puppet] - 10https://gerrit.wikimedia.org/r/357361 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [10:34:11] (03PS1) 10Hashar: jobrunner: add exit codes to services units [puppet] - 10https://gerrit.wikimedia.org/r/357362 [10:34:33] (03CR) 10Ema: [V: 032 C: 032] check_ipmi_temp: turn off sel checking [puppet] - 10https://gerrit.wikimedia.org/r/357361 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [10:35:25] (03CR) 10Hashar: "The alternative is to clean up redisJobChronService / redisJobRunnerService and have them exit(0) when SIGHUP/SIGKILL/SIGTERM are caught." [puppet] - 10https://gerrit.wikimedia.org/r/357362 (owner: 10Hashar) [10:35:34] RECOVERY - OTRS SMTP on mendelevium is OK: SMTP OK - 0.150 sec. response time [10:35:44] RECOVERY - Check systemd state on mendelevium is OK: OK - running: The system is fully operational [10:36:45] marostegui: Around? https://phabricator.wikimedia.org/T165246 says it's resolved but in labs, the column is not there. [10:36:53] Amir1: checking [10:37:28] (03PS9) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [10:38:53] I can see it there on 1001 and 1003 [10:40:35] 10Operations, 10DBA: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3317984 (10jcrespo) I have to package 10.1.24 and fix some things- coming soon. [10:42:43] (03PS1) 10Alexandros Kosiaris: Update Templates for 5.0.20 OTRS version [software/otrs] - 10https://gerrit.wikimedia.org/r/357363 [10:43:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update Templates for 5.0.20 OTRS version [software/otrs] - 10https://gerrit.wikimedia.org/r/357363 (owner: 10Alexandros Kosiaris) [10:48:43] (03CR) 10Ladsgroup: [C: 032] Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup) [10:54:14] RECOVERY - IPMI Temperature on aqs1004 is OK: Sensor Type(s) Temperature Status: OK [10:54:14] RECOVERY - IPMI Temperature on ms-be2028 is OK: Sensor Type(s) Temperature Status: OK [10:54:46] (03PS2) 10Ladsgroup: Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) [10:56:56] (03Draft2) 10Alexandros Kosiaris: Edit Project Config [software/servermon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/357351 [10:57:06] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Edit Project Config [software/servermon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/357351 (owner: 10Alexandros Kosiaris) [10:57:18] (03CR) 10Ladsgroup: [C: 032] Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup) [10:57:54] RECOVERY - IPMI Temperature on labsdb1003 is OK: Sensor Type(s) Temperature Status: OK [10:58:44] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. [10:59:56] (03Merged) 10jenkins-bot: Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup) [11:02:23] !log ladsgroup@tin Synchronized wmf-config/Wikibase-production.php: Enabling writing in full entity id in testwikidatawiki (T165197) (duration: 00m 39s) [11:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:32] T165197: Change configuration of Wikidata to write term_full_entity_id - https://phabricator.wikimedia.org/T165197 [11:02:44] (03CR) 10jenkins-bot: Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup) [11:08:34] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [11:09:34] RECOVERY - IPMI Temperature on labsdb1011 is OK: Sensor Type(s) Temperature Status: OK [11:10:24] RECOVERY - IPMI Temperature on db2049 is OK: Sensor Type(s) Temperature Status: OK [11:11:25] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::scaler: use more sensible intervals for checks [puppet] - 10https://gerrit.wikimedia.org/r/357366 [11:11:42] <_joe_> volans: ^^ care to give a input? [11:12:29] (03PS1) 10Alexandros Kosiaris: servermon: Deploy with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/357367 (https://phabricator.wikimedia.org/T129152) [11:13:48] _joe_: sure [11:14:16] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3318112 (10Marostegui) >>! In T166853#3311393, @jcrespo wrote: > This one is also showing the following alarm- > > > ``` > Sensor Type(s) Temperature Status: Critical [Power Unit 2 18-VR P2 = Critical, Po... [11:15:08] 10Operations, 10Ops-Access-Requests: Request access to analytics-privatedata-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T167116#3318114 (10GoranSMilovanovic) [11:15:23] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/357366 (owner: 10Giuseppe Lavagetto) [11:15:57] (03PS1) 10Ladsgroup: Whitelist term_full_entity_id in wb_terms table [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) [11:16:24] RECOVERY - IPMI Temperature on wtp1010 is OK: Sensor Type(s) Temperature Status: OK [11:16:27] (03CR) 10jerkins-bot: [V: 04-1] Whitelist term_full_entity_id in wb_terms table [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup) [11:16:29] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::scaler: use more sensible intervals for checks [puppet] - 10https://gerrit.wikimedia.org/r/357366 (owner: 10Giuseppe Lavagetto) [11:16:58] !log uploaded ferm 2.3.2+wmf1 to apt.wikimedia.org/stretch-wikimedia (T166653) [11:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:09] T166653: ferm broken in stretch - https://phabricator.wikimedia.org/T166653 [11:19:04] PROBLEM - salt-minion processes on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:04] PROBLEM - dhclient process on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:14] PROBLEM - DPKG on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:14] PROBLEM - Check systemd state on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:24] PROBLEM - Disk space on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:26] godog: I'm done [11:19:34] PROBLEM - configured eth on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:44] PROBLEM - puppet last run on d-i-test is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:04] Amir1: thanks for the heads up! [11:20:32] Sorry it took so long, testing it was difficult [11:21:39] (03CR) 10Ladsgroup: "recheck" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup) [11:21:54] RECOVERY - salt-minion processes on d-i-test is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:21:54] RECOVERY - dhclient process on d-i-test is OK: PROCS OK: 0 processes with command name dhclient [11:22:00] (03CR) 10jerkins-bot: [V: 04-1] Whitelist term_full_entity_id in wb_terms table [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup) [11:22:04] RECOVERY - DPKG on d-i-test is OK: All packages OK [11:22:04] RECOVERY - Check systemd state on d-i-test is OK: OK - running: The system is fully operational [11:22:15] RECOVERY - Disk space on d-i-test is OK: DISK OK [11:22:20] (03PS2) 10Alexandros Kosiaris: servermon: Deploy with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/357367 (https://phabricator.wikimedia.org/T129152) [11:22:24] RECOVERY - configured eth on d-i-test is OK: OK - interfaces up [11:22:34] RECOVERY - puppet last run on d-i-test is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:22:41] d-i-test in puppet ? [11:22:47] (03CR) 10Ladsgroup: "The error doesn't look related" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup) [11:23:07] (03CR) 10Faidon Liambotis: [C: 04-2] "Well, that init script is pretty terrible. We need to code review this and "it was part of a package before" isn't a terribly good argumen" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [11:25:54] (03PS3) 10Alexandros Kosiaris: servermon: Deploy with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/357367 (https://phabricator.wikimedia.org/T129152) [11:26:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/6671/netmon1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/357367 (https://phabricator.wikimedia.org/T129152) (owner: 10Alexandros Kosiaris) [11:26:34] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [11:28:06] !log akosiaris@tin Started deploy [servermon/servermon@4a2288f]: (no justification provided) [11:28:10] !log akosiaris@tin Finished deploy [servermon/servermon@4a2288f]: (no justification provided) (duration: 00m 04s) [11:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:34] RECOVERY - IPMI Temperature on cp1049 is OK: Sensor Type(s) Temperature Status: OK [11:29:58] 10Operations, 10Deployment-Systems, 10Monitoring, 10scap2, and 2 others: Deploy servermon with scap3 - https://phabricator.wikimedia.org/T129152#3318171 (10akosiaris) 05Open>03Resolved a:03akosiaris Migration completed. Servermon is now deployed using scap3. Resolving. [11:34:45] ACKNOWLEDGEMENT - MegaRAID on ms-be2001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167118 [11:34:48] 10Operations, 10ops-codfw: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3318179 (10ops-monitoring-bot) [11:36:16] (03PS2) 10Hashar: Test that replica counts are within sane bounds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345 (owner: 10DCausse) [11:36:44] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [11:39:30] (03CR) 10Hashar: [C: 032] "Cherry picked on tip of master :-} Thanks for the cleanup." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345 (owner: 10DCausse) [11:40:33] (03Merged) 10jenkins-bot: Test that replica counts are within sane bounds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345 (owner: 10DCausse) [11:40:42] (03CR) 10jenkins-bot: Test that replica counts are within sane bounds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357345 (owner: 10DCausse) [11:46:26] 10Operations, 10Monitoring, 10Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3318241 (10ema) [11:46:29] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3318229 (10ema) [11:46:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [11:46:44] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [11:47:25] (03PS4) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 [11:47:57] (03PS1) 10Gehel: elasticsearch - raise logging of TransportShardBulkAction to WARN [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) [11:48:12] 10Operations, 10ops-codfw, 10Traffic: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3318249 (10Volans) [11:48:18] <_joe_> ema: ^^, but I still want to add a reactor.stop() or something to the deferreds treating ipvs [11:48:42] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3318179 (10Volans) [11:48:57] (03CR) 10jerkins-bot: [V: 04-1] Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [11:50:42] _joe_: protocol pony :) [11:50:58] <_joe_> ema: :P [11:51:06] <_joe_> ema: uhm I did break some tests [12:02:03] 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3318296 (10jcrespo) [12:08:15] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2039774 [12:08:25] !log kill stuck osm replication on maps1001 [12:08:25] 10Operations, 10Upstream: ferm broken in stretch - https://phabricator.wikimedia.org/T166653#3318320 (10MoritzMuehlenhoff) I've uploaded ferm 2.3-2+wmf1 to stretch-wikimedia which unbreaks ferm by waiting on nss-lookup.target. This makes ferm start 1-1.5 seconds later than the default stretch unit using netwo... [12:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:40] tmux [12:08:51] oops, wrong windows... [12:10:55] (03CR) 10Hashar: [C: 04-1] "Your change remove Java 7 from the Jessie slaves. However we still have Maven jobs using Java 7:" [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [12:16:03] !log cp1049 - restaret varnish backend for mailbox lag [12:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:15] (03CR) 10Mforns: "This patch is to be abandoned at some point right?" [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [12:17:34] (03CR) 10Hashar: [C: 031] "This change is still cherry picked on the CI puppet master. That is to unbreak Puppet on the permanent instances that still have HHVM." [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [12:18:15] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [12:18:26] 10Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3318345 (10Marostegui) 05Open>03Resolved Going to close this for now as we had no more crashes lately. [12:24:55] (03PS7) 10Paladox: contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) [12:26:00] (03Abandoned) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [12:26:07] (03CR) 10Paladox: "Thanks, done. @Dzahn we have to install java 7 and java 8 on Jessie so have to do the if checks like this. Some of the android tests were " [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [12:28:41] (03PS1) 10Elukey: Disable role::analytics_cluster::refinery::job::guard [puppet] - 10https://gerrit.wikimedia.org/r/357372 (https://phabricator.wikimedia.org/T166937) [12:29:12] paravoid: --^ [12:29:41] 10Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3318369 (10MoritzMuehlenhoff) >>! In T158583#3310361, @faidon wrote: >> >> I think a mere component/component-staging mapping would be better; it provides more consistency and would also allow generic handlin... [12:30:34] (03CR) 10Elukey: [C: 032] Disable role::analytics_cluster::refinery::job::guard [puppet] - 10https://gerrit.wikimedia.org/r/357372 (https://phabricator.wikimedia.org/T166937) (owner: 10Elukey) [12:33:24] (03CR) 10DCausse: elasticsearch - raise logging of TransportShardBulkAction to WARN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) (owner: 10Gehel) [12:35:00] (03CR) 10Mforns: [C: 031] "LGTM!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [12:35:56] (03PS1) 10Hashar: contint: remove HHVM from Trusty permanent instances [puppet] - 10https://gerrit.wikimedia.org/r/357373 [12:37:20] (03PS2) 10Gehel: elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) [12:38:32] (03CR) 10Hashar: [C: 031] "Cherry picked on CI puppet master. I have manually purged HHVM." [puppet] - 10https://gerrit.wikimedia.org/r/357373 (owner: 10Hashar) [12:39:05] elukey: <3 [12:39:13] 10Operations, 10Traffic, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3318372 (10Gehel) I can't seem to reproduce the problem from my browser. Looking at the [[ https://grafana.wikimedia.org/dashboard/db/maps-... [12:40:05] !log mobrovac@tin Started deploy [changeprop/deploy@e92dd66]: Bump src to bc8abf3 [12:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:50] !log mobrovac@tin Finished deploy [changeprop/deploy@e92dd66]: Bump src to bc8abf3 (duration: 01m 45s) [12:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:16] 10Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3318374 (10faidon) >>! In T158583#3318369, @MoritzMuehlenhoff wrote: >>>! In T158583#3310361, @faidon wrote: >>> >>> I think a mere component/component-staging mapping would be better; it provides more consis... [12:44:00] (03PS2) 10Filippo Giunchedi: Scap: Bump version to 3.5.8-1 [puppet] - 10https://gerrit.wikimedia.org/r/357239 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [12:44:27] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3318379 (10Cmjohnson) @fguinchedi he batteries for ms-be1020 and 1019 are on-site...please let me know when you want to swap them [12:45:12] 10Operations, 10ops-eqiad, 10DBA: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3318380 (10Cmjohnson) @Marostegui The battery is here...let me know when you want to replace [12:45:25] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [12:47:15] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2 [12:47:21] cmjohnson1: awesome, today in 20" works for you? re: hp battery [12:47:39] godog..sure [12:48:25] cmjohnson1: nice, I'll ping you in 20" ! [12:48:31] (03CR) 10Filippo Giunchedi: [C: 032] Scap: Bump version to 3.5.8-1 [puppet] - 10https://gerrit.wikimedia.org/r/357239 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [12:48:55] PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198 [12:49:45] RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [12:49:59] 10Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3318384 (10MoritzMuehlenhoff) Ok, got it. I think there are valid use cases for both, for a temporary migration (e.g. towards a new HHVM LTS) it seems more useful to use -staging, while for more generational c... [12:50:05] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 39 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:51:25] !log upgrade scap to 3.5.8 - T127762 [12:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:33] T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762 [12:52:19] hashar: nothing for eu swat so far [12:53:14] 10Operations, 10ops-eqiad, 10DBA: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3318391 (10Marostegui) @Cmjohnson I will depool the server now and ping you once it is down. [12:54:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357375 (https://phabricator.wikimedia.org/T166518) [12:54:15] jouncebot: refresh [12:54:17] I refreshed my knowledge about deployments. [12:54:21] jouncebot: next [12:54:21] In 0 hour(s) and 5 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1300) [12:54:30] zeljkof: nice :-} [12:55:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357375 (https://phabricator.wikimedia.org/T166518) (owner: 10Marostegui) [12:56:51] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357375 (https://phabricator.wikimedia.org/T166518) (owner: 10Marostegui) [12:57:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357375 (https://phabricator.wikimedia.org/T166518) (owner: 10Marostegui) [12:58:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 for maintenance - T166518 (duration: 00m 39s) [12:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:13] T166518: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518 [12:58:23] !log Shutdown db1094 for maintenance - T166518 [12:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1300). Please do the needful. [13:01:58] (03CR) 10Alexandros Kosiaris: Refactor facts exporting to better cleanup facts (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356814 (owner: 10Alexandros Kosiaris) [13:03:18] (03PS4) 10Alexandros Kosiaris: Refactor facts exporting to better cleanup facts [puppet] - 10https://gerrit.wikimedia.org/r/356814 [13:05:17] (03CR) 10DCausse: [C: 031] elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) (owner: 10Gehel) [13:07:47] (03CR) 10Filippo Giunchedi: "LGTM, for now it would work as-is. Though since runtime can be significant I think an improvement would be to accept an optional list of f" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans) [13:08:32] (03PS1) 10Filippo Giunchedi: install_server: ms-be2013 / 16 / 17 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357377 (https://phabricator.wikimedia.org/T162609) [13:08:34] (03PS1) 10Filippo Giunchedi: hieradata: fix missing yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/357378 [13:11:18] cmjohnson1: ok to start from ms-be1019 ? I'll power down [13:13:53] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3318430 (10fgiunchedi) @Papaul this host is scheduled for decom and has otherwise no production data, don't bother replacing the disk [13:14:59] godog hold on...I only received 1 bbu for you not the 2 [13:15:17] let me check to make sure we're replacing the correct server in case HP needs a log [13:15:30] cmjohnson1: ok [13:16:35] RECOVERY - HP RAID on db1094 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK [13:16:41] godog: let's do ms-be1020 please [13:17:03] cmjohnson1: sure, I'll downtime and power off [13:19:12] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/357378 (owner: 10Filippo Giunchedi) [13:21:22] (03PS3) 10Gehel: elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) [13:21:25] cmjohnson1: should be off now / shortly [13:21:53] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3318469 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good now - thanks Chris! ``` Cache Backup Power Source: Batteries Battery/Capacitor... [13:24:30] godog: powering on [13:26:44] (03PS1) 10Marostegui: db-eqiad.php: Repool db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357379 [13:26:52] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3318478 (10ema) [13:28:27] (03PS2) 10Filippo Giunchedi: install_server: ms-be2013 / 16 / 17 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357377 (https://phabricator.wikimedia.org/T162609) [13:29:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357379 (owner: 10Marostegui) [13:29:58] cmjohnson1: yep thanks it 1020 is back, I'll update the task [13:30:15] 10Operations, 10Deployment-Systems, 10MediaWiki-JobRunner, 10Release-Engineering-Team (Kanban), 10Scap (Scap3-Adoption-Phase1): figure out how to not restart jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3318487 (10thcipriani) [13:31:25] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3318492 (10fgiunchedi) ms-be1020 had its bbu swapped, error cleared: ``` # /usr/local/lib/nagios/plugins/check_hpssacli OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I... [13:32:16] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357379 (owner: 10Marostegui) [13:32:24] (03CR) 10Filippo Giunchedi: [C: 032] install_server: ms-be2013 / 16 / 17 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/357377 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [13:32:28] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357379 (owner: 10Marostegui) [13:33:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 with low weight (duration: 00m 40s) [13:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:47] 10Operations, 10Monitoring, 10Services (next), 10User-Joe, 10User-mobrovac: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3318499 (10mobrovac) [13:34:06] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318500 (10Cmjohnson) @elukey new raid controllers for an1033 and 1039 are on-site. please let me know when you want to swap them out [13:34:43] cmjohnson1: whenever you want! [13:34:52] I'd just need half an hour to drain the hosts [13:34:56] let's do this now so I can get back to the new servers [13:35:04] okay...ping me once they're ready [13:35:32] cmjohnson1: sure, draining them now [13:36:12] 10Operations, 10ops-eqiad, 10User-Joe: Decom mw1170-mw1179, and replace them with new systems. - https://phabricator.wikimedia.org/T167130#3318501 (10Cmjohnson) [13:37:09] 10Operations, 10ops-eqiad, 10User-Joe: Decom mw1170-mw1179, and replace them with new systems. - https://phabricator.wikimedia.org/T167130#3318513 (10Cmjohnson) [13:37:11] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3318514 (10Cmjohnson) [13:37:15] RECOVERY - HP RAID on ms-be1020 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [13:38:23] 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301040 (10MoritzMuehlenhoff) I've added Keith to pwstore and he confirmed that it's working fine. [13:38:35] 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3318523 (10MoritzMuehlenhoff) [13:38:56] (03PS1) 10Marostegui: db-eqiad.php: Increase db1094 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357381 [13:39:32] !log shutdown analytics1033 and analytics1039 to replace their BBU - T166140 [13:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:43] T166140: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140 [13:42:31] 10Operations, 10OTRS: Upgrade OTRS to 5.0.20 - https://phabricator.wikimedia.org/T167131#3318529 (10akosiaris) [13:42:41] 10Operations, 10OTRS: Upgrade OTRS to 5.0.20 - https://phabricator.wikimedia.org/T167131#3318544 (10akosiaris) 05Open>03Resolved [13:43:43] 10Operations, 10OTRS, 10Patch-For-Review: Upgrade OTRS to 5.0.19 - https://phabricator.wikimedia.org/T165284#3261913 (10akosiaris) Almost a week has passed, I 'll resolve this one. Feel free to reopen. Note that per T167131 we have already upgraded to 5.0.20 [13:43:50] 10Operations, 10OTRS, 10Patch-For-Review: Upgrade OTRS to 5.0.19 - https://phabricator.wikimedia.org/T165284#3318550 (10akosiaris) 05Open>03Resolved [13:44:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1094 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357381 (owner: 10Marostegui) [13:45:00] 10Operations, 10OTRS, 10Upstream: Investigate OTRS 5.0.6 memory leak - https://phabricator.wikimedia.org/T126448#3318552 (10akosiaris) 05Open>03declined I am gonna resolve this as Declined. Upstream did not verify this bug's existence and we have mitigations in place anyway. [13:45:23] (03PS4) 10Tjones: Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) [13:45:26] cmjohnson1: the hosts should begin to shutdown in a minute [13:45:33] (analytics1033 and 1039) [13:45:48] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1094 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357381 (owner: 10Marostegui) [13:45:57] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1094 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357381 (owner: 10Marostegui) [13:46:03] 10Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3318555 (10jcrespo) 05Open>03stalled [13:46:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1094 weight (duration: 00m 40s) [13:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:26] (03PS2) 10Andrew Bogott: novastats: Add 'diskspace.py' script [puppet] - 10https://gerrit.wikimedia.org/r/357014 (https://phabricator.wikimedia.org/T163796) [13:50:37] (03CR) 10Andrew Bogott: [C: 032] novastats: Add 'diskspace.py' script [puppet] - 10https://gerrit.wikimedia.org/r/357014 (https://phabricator.wikimedia.org/T163796) (owner: 10Andrew Bogott) [13:51:25] (03PS2) 10Filippo Giunchedi: hieradata: fix missing yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/357378 [13:52:53] (03PS1) 10Marostegui: db-eqiad.php: Restore db1094 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357383 [13:52:55] (03CR) 10Muehlenhoff: [C: 031] hieradata: fix missing yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/357378 (owner: 10Filippo Giunchedi) [13:53:11] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: fix missing yaml extension [puppet] - 10https://gerrit.wikimedia.org/r/357378 (owner: 10Filippo Giunchedi) [13:54:49] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3318627 (10faidon) >>! In T166888#3316057, @greg wrote: > Looking at the data we have it seems that the tests themselves take about [[ https://integration.wikimedia.org... [13:57:10] ugh looks like eqiad / smokeping can't talk at all to cr1-eqdfw ? https://smokeping.wikimedia.org/smokeping.cgi?target=codfw.Core.cr1-eqdfw [13:57:30] XioNoX ^ [13:59:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1094 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357383 (owner: 10Marostegui) [14:00:33] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1094 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357383 (owner: 10Marostegui) [14:00:42] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1094 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357383 (owner: 10Marostegui) [14:01:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1094 original weight (duration: 00m 40s) [14:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:10] godog, I can reach that router, investigating, is that causing any outage? [14:02:39] elukey: 1033 is powering up [14:03:03] XioNoX: no impact afaict no, I was surprised though that smokeping in eqiad stopped being able to talk to it [14:03:32] (03PS2) 10Mobrovac: Set the User-Agent header field when doing requests; v0.1.2 [software/service-checker] - 10https://gerrit.wikimedia.org/r/356870 [14:05:12] 10Operations, 10Monitoring, 10Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3318684 (10ema) [14:05:15] 10Operations, 10Monitoring: labvirt1008/labsdb1001: FreeIPMI returned an empty header map - https://phabricator.wikimedia.org/T167138#3318672 (10ema) [14:07:07] cmjohnson1: ack [14:15:30] an1033 looks good [14:16:53] (03PS1) 10Andrew Bogott: diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388 [14:19:22] (03CR) 10jerkins-bot: [V: 04-1] diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388 (owner: 10Andrew Bogott) [14:20:12] (03PS2) 10Andrew Bogott: diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388 [14:20:32] elukey: 1039 is powering on [14:21:18] godog: yeah, it's weird, mtr works, but not pings [14:23:59] (03CR) 10Giuseppe Lavagetto: [C: 032] Set the User-Agent header field when doing requests; v0.1.2 [software/service-checker] - 10https://gerrit.wikimedia.org/r/356870 (owner: 10Mobrovac) [14:24:04] XioNoX: I'm seeing at least a couple of Icinga alarms flapping regarding 208.80.153.198 [14:25:39] cmjohnson1: thanks! [14:25:40] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2046271 [14:26:10] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [14:26:14] 10Operations, 10Monitoring, 10Services (next), 10User-Joe, 10User-mobrovac: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3315673 (10Joe) I would start monitoring restbase on text-lb and maps on text-upload. [14:27:00] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [14:27:19] XioNoX: are we losing eqdfw? [14:27:55] paravoid: some conenctivity issues from at least eqiad, still investigating [14:28:00] PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198 [14:28:05] there is no eqiad->eqdfw [14:28:07] just codfw->eqdfw [14:28:26] XioNoX: the active icinga is tegmen now and it is in codfw [14:29:44] (03PS10) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [14:29:49] (03PS7) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [14:29:50] RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [14:32:44] (03Abandoned) 10Hashar: contint: ElasticSearch role for build logs [puppet] - 10https://gerrit.wikimedia.org/r/322488 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar) [14:32:54] paravoid: like icmp can't go through but mtr works fine [14:33:14] except directly from outside [14:33:56] (03PS3) 10Andrew Bogott: diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388 [14:36:34] (03CR) 10Andrew Bogott: [C: 032] diskspace.py: Catch stray instances that nova and filesystem disagree about [puppet] - 10https://gerrit.wikimedia.org/r/357388 (owner: 10Andrew Bogott) [14:42:48] cmjohnson1: 1039 is good, thanks a lot! [14:42:58] great...the other okay? [14:43:04] yep yep all good [14:43:10] the BBU shows up as optimal now [14:44:00] (03Abandoned) 10Hashar: contint: update unattended-upgrade setting [puppet] - 10https://gerrit.wikimedia.org/r/315079 (owner: 10Hashar) [14:44:02] (03Abandoned) 10Hashar: contint: unattended upgrade from distro [puppet] - 10https://gerrit.wikimedia.org/r/315084 (owner: 10Hashar) [14:45:24] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318884 (10Cmjohnson) 05Open>03Resolved Replaced both bbu's Return shipping info Fedex 9612018 6911799 02034386 96112018 6911799 02034379 [14:45:47] (03PS3) 10Hashar: zuul: rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/299151 [14:47:49] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318897 (10elukey) 05Resolved>03Open [14:47:57] (03CR) 10Hashar: [V: 031 C: 031] "I have used that patch when refactoring the zuul class to use hiera. I typically use this to assert the zuul manifests somehow compile." [puppet] - 10https://gerrit.wikimedia.org/r/299151 (owner: 10Hashar) [14:49:00] PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198 [14:49:12] (03PS3) 10Hashar: nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 [14:49:50] RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [14:49:55] (03CR) 10Hashar: [V: 031 C: 031] "That covers an issue we had earlier when generating the icinga contacts (see fix https://gerrit.wikimedia.org/r/#/c/331459/ )" [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar) [14:50:13] (03CR) 10Hashar: [V: 031 C: 031] nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar) [14:50:27] (03PS1) 10Filippo Giunchedi: swift: create swift user home [puppet] - 10https://gerrit.wikimedia.org/r/357396 (https://phabricator.wikimedia.org/T162609) [14:53:00] PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198 [14:53:50] RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [14:56:09] (03PS2) 10Filippo Giunchedi: swift: create swift user home [puppet] - 10https://gerrit.wikimedia.org/r/357396 (https://phabricator.wikimedia.org/T162609) [14:56:43] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318969 (10elukey) [14:57:54] (03CR) 10Filippo Giunchedi: [C: 031] Refactor facts exporting to better cleanup facts [puppet] - 10https://gerrit.wikimedia.org/r/356814 (owner: 10Alexandros Kosiaris) [14:58:25] !log installing libsndfile security updates on trusty [14:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:36] (03PS8) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [15:00:05] (03PS3) 10Filippo Giunchedi: swift: create swift user home [puppet] - 10https://gerrit.wikimedia.org/r/357396 (https://phabricator.wikimedia.org/T162609) [15:02:50] (03CR) 10Filippo Giunchedi: [C: 032] swift: create swift user home [puppet] - 10https://gerrit.wikimedia.org/r/357396 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [15:02:50] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [15:03:40] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2 [15:06:10] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [15:07:00] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [15:07:37] 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3319033 (10Papaul) p:05Triage>03Normal [15:07:50] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [15:09:40] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2 [15:12:00] PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198 [15:13:50] RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [15:15:44] (03CR) 10Paladox: [C: 031] "Looks correct in how it should be translated to json. Though untested." [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043) (owner: 1020after4) [15:16:21] (03CR) 10Madhuvishy: [C: 032] labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 (owner: 10Faidon Liambotis) [15:16:56] (03CR) 10Madhuvishy: [C: 032] labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 (owner: 10Faidon Liambotis) [15:17:24] madhuvishy: I'd do https://gerrit.wikimedia.org/r/#/c/356107/ first, then PCC the rest [15:17:40] but ymmv :) [15:18:15] paravoid: ah yes that's why it wouldn't let me rebase, yup okay [15:18:21] (03CR) 10Faidon Liambotis: [C: 04-1] "Find a way to test this?" [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043) (owner: 1020after4) [15:21:51] !log otto@tin Started deploy [eventlogging/analytics@37233cd]: (no justification provided) [15:21:56] !log otto@tin Finished deploy [eventlogging/analytics@37233cd]: (no justification provided) (duration: 00m 04s) [15:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:10] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:10] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:10] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:11] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:11] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:11] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:11] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:12] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:12] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:14] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:23:14] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad PMCID) timed out before a response was received [15:24:01] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [15:24:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [15:24:03] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [15:24:03] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [15:24:03] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:24:03] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [15:24:03] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [15:24:03] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:24:03] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [15:24:10] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [15:24:10] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:26:50] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [15:27:59] (03CR) 10Madhuvishy: "I get the reasoning for `which tc`, but I think we should err on the side of fully qualifying the path with TC=/sbin/tc. This would be san" [puppet] - 10https://gerrit.wikimedia.org/r/356107 (owner: 10Faidon Liambotis) [15:28:40] (03PS4) 10Ottomata: Add exception for events tagged as coming from MW [puppet] - 10https://gerrit.wikimedia.org/r/356626 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [15:28:50] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2 [15:29:10] madhuvishy: it will be just noise [15:29:44] all of these fully-qualified paths are just misconceptions and people carrying over old unix practices to modern systems [15:29:47] (03CR) 10jerkins-bot: [V: 04-1] Add exception for events tagged as coming from MW [puppet] - 10https://gerrit.wikimedia.org/r/356626 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [15:30:02] (03PS1) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140) [15:30:18] (03PS4) 10Gehel: elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) [15:32:01] paravoid: sorry, in team meeting, will respond in a bit [15:32:08] k, sorry [15:32:19] (03PS5) 10Ottomata: Add exception for events tagged as coming from MW [puppet] - 10https://gerrit.wikimedia.org/r/356626 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [15:32:41] I really don't care that much though :) [15:33:55] (03PS2) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140) [15:34:18] 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3319132 (10Papaul) a:05Papaul>03jcrespo firmware upgrade complete [15:34:20] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:35:26] (03PS3) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140) [15:38:29] I hate how this is a hiera variable :( [15:39:29] XioNoX: what's the status of eqdfw then? [15:40:51] paravoid: still investigating, IPv6 goes through fine, but v4 doesn't in some cases [15:42:05] Ipv6 is a b!** [15:42:11] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [15:42:19] (03CR) 10Ottomata: [C: 032] Add exception for events tagged as coming from MW [puppet] - 10https://gerrit.wikimedia.org/r/356626 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [15:43:01] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [15:43:20] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [15:44:00] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [15:44:50] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2 [15:45:40] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 2010 [15:51:44] (03PS5) 10Gehel: elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) [15:53:51] (03CR) 10Gehel: [C: 032] elasticsearch - raise logging of actions to INFO [puppet] - 10https://gerrit.wikimedia.org/r/357371 (https://phabricator.wikimedia.org/T167091) (owner: 10Gehel) [15:58:34] (03PS1) 10Giuseppe Lavagetto: role::graphite::alerts: add transformNull to some alerts [puppet] - 10https://gerrit.wikimedia.org/r/357409 [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1600). [16:00:52] (03PS2) 10Giuseppe Lavagetto: role::graphite::alerts: add transformNull to some alerts [puppet] - 10https://gerrit.wikimedia.org/r/357409 [16:02:00] PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198 [16:02:04] <_joe_> oh jenkins you'll get me old [16:02:11] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::graphite::alerts: add transformNull to some alerts [puppet] - 10https://gerrit.wikimedia.org/r/357409 (owner: 10Giuseppe Lavagetto) [16:02:32] _joe_: got to love ci [16:02:53] RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [16:06:13] (03PS4) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140) [16:06:50] PROBLEM - puppet last run on graphite1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:07:55] (03PS5) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140) [16:11:00] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [16:11:50] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2 [16:13:24] mmmm noop for pcc, weird [16:15:16] argh I got the wrong role [16:15:52] (03CR) 10Hashar: [C: 031] contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [16:16:20] ah nice we have the same info duplicated [16:17:08] ottomata: do we need hieradata/role/common/analytics/hadoop/worker.yaml since we have hieradata/role/common/analytics_cluster/hadoop/worker.yaml ?? [16:18:09] paravoid: following up, this script is run both by puppet, but also intended to be run by users operationally - in that case, user PATH could affect the script running [16:18:50] I'm in meeting now :) [16:18:59] but why would users run this operationally? [16:18:59] mostly just seems cleaner to me, and in coherence with every other script to fully qualify paths. if we have a standard for this sorta thing, i'm happy to follow it :) [16:19:00] okay :) [16:20:24] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3319296 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [16:20:32] (03PS6) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140) [16:22:16] (03PS7) 10Elukey: Set profile::base::check_raid_policy to 'WriteBack' for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140) [16:22:44] RECOVERY - puppet last run on graphite1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:24:24] ok this should work [16:24:34] need to remove the stale hieradata [16:24:51] (03PS1) 10DCausse: [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) [16:26:55] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/6676/" [puppet] - 10https://gerrit.wikimedia.org/r/357403 (https://phabricator.wikimedia.org/T166140) (owner: 10Elukey) [16:27:48] (03CR) 10Chad: "Yeah, it's a pretty terrible script. It comes directly from upstream, we've never actually done any changes to it ourselves." [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [16:28:35] (03CR) 10EBernhardson: [C: 031] [cirrus] Enable crossproject search on all wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse) [16:29:10] (03PS2) 10DCausse: [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) [16:31:21] (03PS3) 10DCausse: [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) [16:35:44] !log rebooted lvs1007 (kernel update) [16:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:44] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:39:06] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3319367 (10greg) Look, we get it, CI is slower than people would like. When we proposed the nodepool backend we were optimizing for clean environment and maintainabilit... [16:39:44] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [16:39:52] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3319369 (10elukey) Current status: ``` elukey@neodymium:~$ sudo cumin 'R:class = role::analytics_cluster::hadoop::worker' 'megacl... [16:40:10] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3319372 (10greg) We're still open to helping get ops/puppet in a better place than it is now with small wins until we can migrate to the new docker based system, if you... [16:41:24] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3319375 (10jcrespo) `Rebuilding`, will resolve once it is done. [16:41:28] !log rebooted lvs1007 (kernel update) [16:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:59] 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3319376 (10jcrespo) Papaul, you are the best! [16:43:04] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:21] (03PS1) 10Muehlenhoff: Extend account expiry date for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/357417 [16:45:14] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [16:46:39] (03CR) 10Muehlenhoff: [C: 032] Extend account expiry date for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/357417 (owner: 10Muehlenhoff) [16:46:58] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3319383 (10Papaul) a:05Papaul>03Gehel @ Gehel The new SSD is in place [16:49:23] (03PS1) 10Elukey: Delete unused role/common/analytics/hadoop configs [puppet] - 10https://gerrit.wikimedia.org/r/357418 [16:49:44] RECOVERY - HP RAID on elastic2020 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2 - Controller: OK - Battery/Capacitor: OK [16:50:47] gehel: is that true? ^^^^ I cannot believe it :-P [16:51:31] volans: beleive it or not, but elastic2020 might be back in the cluster tomorrow! (emphasis on *might*) [16:51:40] lol :D [16:53:17] (03CR) 10Tjones: [C: 031] [cirrus] Enable crossproject search on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357413 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse) [16:53:44] !log installing wireshark security updates on trusty (jessie already fixed) [16:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:40] 10Operations, 10Commons, 10Wikimedia-Site-requests, 10media-storage, 10Patch-For-Review: Server side upload for Yann - https://phabricator.wikimedia.org/T166806#3319419 (10Yann) Other small files uploaded OK. Thanks to @Dereckson for processing this. [16:58:27] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3319424 (10RobH) [16:58:57] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3219.30 Read Requests/Sec=2155.30 Write Requests/Sec=18.10 KBytes Read/Sec=35650.80 KBytes_Written/Sec=92.00 [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1700). [17:00:16] Nothing for ORES today [17:02:08] 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3319450 (10Papaul) @jcrespo thanks [17:04:42] paravoid: mostly for testing, when a rule is changed etc - they are just one offs - i'm not married to the idea of using fully qualified paths, but made sense to be consistent. [17:05:37] PROBLEM - swift-account-auditor on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:37] PROBLEM - swift-container-server on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:37] PROBLEM - swift-object-auditor on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:37] PROBLEM - swift-account-server on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:37] PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:47] PROBLEM - MD RAID on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:47] PROBLEM - swift-object-replicator on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:47] PROBLEM - Disk space on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:47] PROBLEM - swift-container-auditor on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:48] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:48] PROBLEM - swift-container-updater on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:57] PROBLEM - swift-account-replicator on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:58] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=20.10 Read Requests/Sec=7.50 Write Requests/Sec=11.80 KBytes Read/Sec=30.40 KBytes_Written/Sec=77.20 [17:05:58] PROBLEM - configured eth on ms-be2016 is CRITICAL: Return code of 255 is out of bounds [17:05:59] that's me, downtime expired [17:06:07] fixed [17:06:57] RECOVERY - configured eth on ms-be2016 is OK: OK - interfaces up [17:07:37] RECOVERY - swift-account-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:07:37] RECOVERY - swift-container-server on ms-be2016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:07:37] RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 31.04, 29.33, 19.21 [17:07:37] RECOVERY - swift-object-auditor on ms-be2016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:07:37] RECOVERY - swift-account-server on ms-be2016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:07:47] RECOVERY - MD RAID on ms-be2016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:07:47] RECOVERY - swift-object-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:07:47] RECOVERY - Disk space on ms-be2016 is OK: DISK OK [17:07:48] RECOVERY - swift-container-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:07:48] RECOVERY - swift-container-updater on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:07:57] RECOVERY - swift-account-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:09:10] (03CR) 10Elukey: "10 NO-OPs: https://puppet-compiler.wmflabs.org/6678/" [puppet] - 10https://gerrit.wikimedia.org/r/357418 (owner: 10Elukey) [17:11:47] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational [17:15:55] (03PS1) 10Filippo Giunchedi: aptrepo: add hp-mcp-stretch [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) [17:16:43] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3319490 (10RobH) [17:16:49] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2001 - https://phabricator.wikimedia.org/T167160#3319508 (10RobH) [17:19:30] mobrovac: _joe_ jrbranaa blubber workboard created, moved 2 tasks that looked related, mess up as you all see fit: https://phabricator.wikimedia.org/project/view/2812/ https://phabricator.wikimedia.org/source/blubber/ [17:25:07] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 14 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [17:25:59] (03PS1) 10Chad: Group0 to wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357424 [17:26:11] (03CR) 10Chad: [C: 04-2] "For later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357424 (owner: 10Chad) [17:29:01] (03PS2) 10Filippo Giunchedi: aptrepo: add hp-mcp-stretch and thirdparty/hwraid [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) [17:29:03] (03PS1) 10Jdlrobson: Disable page previews on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894) [17:30:18] tgr: HEY.. i cant login to wikitech anymore - phone broken and no access to google authenticator [17:30:28] can you reset me again? [17:31:06] !log demon@tin Started scap: testwiki to wmf.3, prepping l10n cache [17:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:17] RECOVERY - configured eth on labtestvirt2003 is OK: OK - interfaces up [17:32:40] ^ papaul result of your fix thank you [17:33:36] chasemp: no problem [17:33:51] (03PS1) 10Jdlrobson: Update ContentNamespaces for Commons Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077) [17:48:41] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:48:46] 10Operations, 10Ops-Access-Requests: Request access to analytics-privatedata-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T167116#3318114 (10RobH) I can confirm that @GoranSMilovanovic has signed an NDA with WMF Legal (I checked against the 2016/17 NDA housekeeping: Volunteer accounts with S... [17:53:53] ^^ elastic2020 downtime seems to have expired, I'm adding some downtime, waiting for the reimage... [17:55:05] (03PS1) 10Andrew Bogott: diskspace.py: Add one more special-case flavor size. [puppet] - 10https://gerrit.wikimedia.org/r/357431 [17:56:05] (03PS2) 10Andrew Bogott: diskspace.py: Add one more special-case flavor size. [puppet] - 10https://gerrit.wikimedia.org/r/357431 [18:00:04] MaxSem and Niharika: Dear anthropoid, the time has come. Please deploy Deploy LoginNotify (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1800). [18:01:15] MaxSem: Just about done with my thing [18:03:04] !log demon@tin Finished scap: testwiki to wmf.3, prepping l10n cache (duration: 31m 58s) [18:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:49] MaxSem: All yours [18:05:00] danke, RainbowSprinkles [18:05:33] (03PS2) 10MaxSem: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007) [18:05:45] (03CR) 10MaxSem: [C: 032] Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007) (owner: 10MaxSem) [18:06:36] RainbowSprinkles, should I revert the livehack (testwiki to wmf.3)? [18:06:37] I'm around to test, MaxSem. [18:06:51] (03Merged) 10jenkins-bot: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007) (owner: 10MaxSem) [18:07:05] MaxSem: Rather you not. Feel free to put a local commit there for it [18:08:11] it doesn't mind if I pull, so just leaving it there [18:10:45] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/357317/2 (duration: 00m 44s) [18:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:31] Niharika, pulled on mwdebug1002 [18:11:40] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [18:11:51] Checking. [18:12:46] MaxSem: Looks good to me. [18:13:47] (03CR) 10jenkins-bot: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357317 (https://phabricator.wikimedia.org/T165007) (owner: 10MaxSem) [18:15:32] !log maxsem@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/357317/2 (duration: 00m 44s) [18:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:20] I think downtimes got lost again [18:16:51] !log maxsem@tin Started scap: LoginNotify to testwiki - rebuild messages [18:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:54] 10Operations, 10Icinga, 10Monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3319932 (10jcrespo) I think this happened again. It didn't page because now I disable alerts every time I reimage a host, but page spam will c... [18:20:30] (03Abandoned) 10Chad: Group0 to wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357424 (owner: 10Chad) [18:28:47] MaxSem: Done soon? [18:29:08] 18:24:11 Updating LocalisationCache for 1.30.0-wmf.3 using 10 thread(s) [18:29:15] Gdi. [18:29:26] I fucked up. Should've done wmf.4 anyway instead of wmf.3 [18:29:38] :O [18:29:44] So now you're rebuilding a useless l10n cache [18:30:14] it's probably nearly done [18:34:34] what wmf are we on are we back on schedule or we still behind? [18:36:13] wmf.4 will go out this week [18:36:21] explained here: https://phabricator.wikimedia.org/T165957#3309601 [18:36:25] Also: https://tools.wmflabs.org/versions/ [18:36:35] (will always tell you what version is deployed where) [18:38:26] RainbowSprinkles: that wont load properly for me today... its my end i know so ya [18:40:34] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3320042 (10jcrespo) 05Open>03Resolved [18:50:27] * RainbowSprinkles twiddles thumbs [18:55:11] !log maxsem@tin Finished scap: LoginNotify to testwiki - rebuild messages (duration: 38m 19s) [18:55:19] Niharika, ^ [18:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:24] 38m is way too slow [18:55:31] Curious since I had just scapped prior [18:55:43] Nice! Thanks MaxSem! [18:56:00] Niharika, works ok? [18:56:13] (03CR) 10Andrew Bogott: [C: 032] diskspace.py: Add one more special-case flavor size. [puppet] - 10https://gerrit.wikimedia.org/r/357431 (owner: 10Andrew Bogott) [18:59:32] MaxSem: Seems so. [18:59:38] woot [19:00:00] lunch [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1900). [19:01:54] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: testwiki back to wmf.2 [19:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:01] (03PS1) 10Chad: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357442 [19:08:53] !log demon@tin Synchronized README: No-op, just forcing co-master sync (duration: 01m 27s) [19:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:15] !log demon@tin Started scap: testwiki to wmf.4 + prepping l10n. again [19:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:49] 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#2511148 (10jcrespo) ``` $ cumin 'db20[33-70].*' 'hpssacli controller slot=0 show | grep -i firmware' 38 hosts will be targeted: db[2033-2070].codfw.wmnet Confirm to conti... [19:13:55] 10Operations, 10Monitoring: internal IPMI error - https://phabricator.wikimedia.org/T167121#3318229 (10jcrespo) It could be related to T141756#3320207 [19:17:40] (03CR) 10Ottomata: [C: 031] Delete unused role/common/analytics/hadoop configs [puppet] - 10https://gerrit.wikimedia.org/r/357418 (owner: 10Elukey) [19:20:40] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [19:23:47] !log demon@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [19:23:47] !log demon@tin scap failed: RuntimeError scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) (duration: 13m 32s) [19:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:07] Crap. [19:24:48] !log demon@tin Started scap: testwiki to wmf.4 + prepping l10n. again (x2) [19:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:11] (03PS3) 10Smalyshev: Enable archive indexing on delete for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357236 (https://phabricator.wikimedia.org/T162302) [19:34:50] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [19:35:00] PROBLEM - Nginx local proxy to apache on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:15] RainbowSprinkles ^^ mutante fixed it :) [19:36:15] !log cobalt - removed systemd unit file (that has issues with ulimit and isn't used yet) - ran "systemctl reset-failed" which cleared the "systemctl status" which made the Icinga check recover [19:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:49] RainbowSprinkles: yep, "systemctl reset-failed" is a thing, #systemd told me [19:36:50] RECOVERY - Nginx local proxy to apache on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.791 second response time [19:37:19] so i removed the unit file and did that, and cleared Icinga without any direct action on gerrit [19:37:25] * RainbowSprinkles sighs [19:38:13] The systemd file will be readded when we re do the upgrade (but with the fix :)) so we will be able to try starting gerrit with systemctl again. [19:42:48] 10Operations, 10Traffic, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3320381 (10debt) p:05Triage>03Normal [19:44:18] 10Operations, 10Interactive-Sprint, 10Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#3320388 (10debt) Moving to prioritized as it's on our list of things that do need doing. [19:45:07] 10Operations, 10Discovery, 10Interactive-Sprint, 10Maps (Maps-data): Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#3320393 (10debt) Moving to prioritized as it's on our list of things that do need doing. [19:45:13] !log demon@tin Finished scap: testwiki to wmf.4 + prepping l10n. again (x2) (duration: 20m 25s) [19:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:07] (03CR) 10Chad: [C: 032] group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357442 (owner: 10Chad) [19:54:48] jouncebot: now [19:54:48] For the next 1 hour(s) and 5 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T1900) [19:55:44] (03Merged) 10jenkins-bot: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357442 (owner: 10Chad) [19:57:58] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.4 [19:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:40] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3320503 (10Papaul) [20:12:00] (03CR) 10jenkins-bot: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357442 (owner: 10Chad) [20:13:53] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2001 - https://phabricator.wikimedia.org/T167160#3320517 (10Papaul) @Robh @chasemp we have already a node with the name labtestneutron2001 in row B rack B8 can we make this labtestneutron2002? [20:16:04] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2001 - https://phabricator.wikimedia.org/T167160#3320521 (10chasemp) @papaul yes, thank you [20:16:07] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2001 - https://phabricator.wikimedia.org/T167160#3320522 (10RobH) @Papaul: good catch! Yes, lets just call this new host labtestneutron2002. [20:16:16] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3320523 (10RobH) [20:21:55] !log gerrit: Down for just a moment, finally doing point release on cobalt [20:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:06] 10Operations, 10MediaWiki-General-or-Unknown, 10Security-Team, 10Traffic, and 2 others: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3320680 (10Jdforrester-WMF) Mass-moving all items tagged for MediaWiki 1.30.0-wmf.3, as that was never released; ins... [20:45:30] (03PS1) 10Jdrewniak: Updating portals stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357490 (https://phabricator.wikimedia.org/T128546) [20:45:32] (03PS8) 10Dzahn: contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [20:51:24] (03CR) 10Hashar: [C: 04-1] "I am surprised by how fast the processing is done on my machine. The additional run is barely noticeable on my machine :]" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans) [20:53:06] (03CR) 10Dzahn: "thanks for clarifying that it is indeed intended to install both versions at the same time" [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [20:53:40] (03CR) 10Dzahn: [C: 032] contint: Only install java 7 on trusty and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [20:53:48] thanks mutante ^^ :) [20:53:59] 10Operations, 10Labs, 10cloud-services-team (Kanban): tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123#3320890 (10bd808) [20:54:05] 10Operations, 10Labs, 10cloud-services-team (Kanban): Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#3320892 (10bd808) [20:55:12] mutante: paladox: danke/thanks! [20:55:25] :) your welcome [20:56:01] 10Operations, 10Labs, 10cloud-services-team (Kanban): Investigate alternative RAID strategies for labstore1001/2 - https://phabricator.wikimedia.org/T162090#3320899 (10bd808) [20:56:03] 10Operations, 10Labs, 10cloud-services-team (Kanban): Undo special tools-home and tools-project share definitions for NFS - https://phabricator.wikimedia.org/T161834#3320900 (10bd808) [20:56:10] 10Operations, 10Labs, 10cloud-services-team (Kanban): labstore systemd state Icinga alarms - https://phabricator.wikimedia.org/T151322#3320902 (10bd808) [20:57:11] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3320910 (10herron) a:03herron [20:57:57] de rien [20:58:11] submitted it now (the bot never says that part) [20:59:56] (03PS2) 10Dzahn: contint: remove HHVM from Trusty permanent instances [puppet] - 10https://gerrit.wikimedia.org/r/357373 (owner: 10Hashar) [21:00:41] (03CR) 10Dzahn: [C: 032] "already cherry-picked" [puppet] - 10https://gerrit.wikimedia.org/r/357373 (owner: 10Hashar) [21:01:11] RainbowSprinkles: Looks like the Prefs page on Test Wikipedia is messed up (maybe related to train deployment): https://test.wikipedia.org/wiki/Special:Preferences [21:01:41] I saw someone complaining about this last week but couldn't repro [21:01:42] Hmmm [21:02:28] I think there's some JS not loading, but I don't see any JS errors in the console [21:03:01] Should I file a bug? [21:03:18] mutante: ready to submit :) [21:05:03] kaldari: Yeah file a bug... [21:05:18] I'd say a whole lot of JS isn't loading [21:05:32] The logo changes too from the cool version [21:05:40] hashar: submitted :) [21:05:46] \O/ [21:06:33] one more! [21:06:35] i see [21:06:46] (03PS14) 10Dzahn: contint: skip hhvm experimental pin on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [21:07:30] (03CR) 10Dzahn: [C: 032] "per comments above, also already cherry-picked" [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [21:07:43] thanks :) [21:09:50] paladox: trying to remember about the "nocanon" fix [21:10:12] "https://wiki.jenkins-ci.org/display/JENKINS/Running+Jenkins+behind+Apache states we should use nocanon in Apache ProxyPass" ah, yea [21:10:39] "Yep. We do that for gerrit too." [21:12:20] (03PS7) 10Dzahn: Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 (owner: 10Paladox) [21:12:40] Yep [21:13:11] (03CR) 10Dzahn: "jenkins docs: "Both the nocanon option to ProxyPass, and AllowEncodedSlashes NoDecode, are required for certain Jenkins features to work."" [puppet] - 10https://gerrit.wikimedia.org/r/351391 (owner: 10Paladox) [21:13:34] jynus: you there? [21:13:49] RainbowSprinkles: Created bug: https://phabricator.wikimedia.org/T167216 No idea who to subscribe to it though. [21:14:01] paladox: also quote from Apache mod_proxy docs .. added [21:14:23] (03CR) 10Dzahn: [C: 032] Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 (owner: 10Paladox) [21:14:50] kaldari: We could just start subscribing everyone at random until someone fixes it ;-) [21:15:03] good idea! [21:15:10] tgr: ping [21:15:13] * RainbowSprinkles writes a greasemonkey script called "subscribe-all-the-people" [21:15:26] (03PS1) 10Framawiki: Lift IP throttle for Wikipedia Editathon (June 16th 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357510 (https://phabricator.wikimedia.org/T167201) [21:15:43] mutante thanks :) [21:15:50] kaldari: It's busted on mw.org too, I'm going to roll that back to wmf.2 for now [21:15:55] it needs a subscribe bot that finds the right people. like we have it for gerrit :) [21:16:10] thanks [21:16:30] (03PS1) 10Chad: Moving mediawiki.org back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357511 [21:16:32] that special wiki page would just be "keywords -> people" [21:16:46] (03CR) 10Chad: [C: 032] Moving mediawiki.org back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357511 (owner: 10Chad) [21:17:28] mutante: Oh, I wasn't looking for the right people. I was just going to subscribe people at random until someone fixes it :p [21:17:59] (03Merged) 10jenkins-bot: Moving mediawiki.org back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357511 (owner: 10Chad) [21:18:09] (03CR) 10jenkins-bot: Moving mediawiki.org back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357511 (owner: 10Chad) [21:18:29] RainbowSprinkles: heheee, yea, just need to disable notifications for "unsubscribe" action [21:19:24] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: unbreak mw.org pref page [21:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:38] (03PS2) 10Dzahn: [Planet Wikimedia] Add blog.wikimedia.gr to Greek Planet [puppet] - 10https://gerrit.wikimedia.org/r/357254 (owner: 10Nemo bis) [21:22:00] (03PS1) 10Andrew Bogott: designate.conf: Update the keystone_authtoken section [puppet] - 10https://gerrit.wikimedia.org/r/357512 [21:22:24] kaldari: Definitely isolated to wmf.2 -> wmf.4 jump. Rolling mw.org back fixed it [21:22:50] (03CR) 10Dzahn: "https works here, let's use it wherever possible, amending" [puppet] - 10https://gerrit.wikimedia.org/r/357254 (owner: 10Nemo bis) [21:23:50] RainbowSprinkles: Beta Cluster has the same issue [21:23:52] https://simple.wikipedia.beta.wmflabs.org/wiki/Special:Preferences [21:24:30] So nobody's fixed it yet in master, ok. [21:25:02] Guess I'll add it to the deployment blockers [21:25:20] Yeah please. I'll look at this some more in a bit, gotta run to the post office [21:29:31] twentyafterfour: are you about? [21:30:42] (03PS3) 10Dzahn: [Planet Wikimedia] Add blog.wikimedia.gr to Greek Planet [puppet] - 10https://gerrit.wikimedia.org/r/357254 (owner: 10Nemo bis) [21:35:50] (03CR) 10Dzahn: [C: 032] [Planet Wikimedia] Add blog.wikimedia.gr to Greek Planet [puppet] - 10https://gerrit.wikimedia.org/r/357254 (owner: 10Nemo bis) [21:38:30] TabbyCat: o/ [21:41:53] !log contint1001 - graceful'ed Apache to deploy gerrit:351391 [21:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:20] tgr: can you check https://phabricator.wikimedia.org/T167219 ? [21:46:10] (03PS5) 10Dzahn: flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 (owner: 10BryanDavis) [21:47:26] TabbyCat: do you have that log open? [21:47:53] tgr: nope :( [21:49:17] tgr: I have IP/CIDR if it helps [21:50:53] thx, found it [21:51:42] okay, I'll be around for a couple of minutes, if you need something let me know tgr and I'll see if I can help [21:53:34] !log gerrit: restarting to test a config tweak [21:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:03] 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3321280 (10Paladox) [21:55:31] RainbowSprinkles what config are you testing? :) [21:56:35] doesn't matter, didn't work [21:56:50] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:57:00] !log gerrit: restarting last time, didn't work like I wanted [21:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:05] (03CR) 10Volans: "> I am surprised by how fast the processing is done on my machine." (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans) [22:00:23] (03PS4) 10Volans: Tox: find and check Python files without extension [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) [22:00:52] RainbowSprinkles we can close https://phabricator.wikimedia.org/T158946 as resolved now? [22:01:45] thanks :) [22:04:40] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2038941 [22:05:00] 10Operations, 10Gerrit, 10Beta-Cluster-reproducible, 10Patch-For-Review, and 2 others: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#3321381 (10demon) 05Open>03Resolved This shouldn't actually be a problem anymore. [22:09:46] (03CR) 10Bearloga: Add Shiny Server module and Discovery Dashboards role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [22:10:59] (03PS1) 10Eevans: WIP: throwing things against Puppet Compiler to see what sticks [puppet] - 10https://gerrit.wikimedia.org/r/357515 (https://phabricator.wikimedia.org/T167222) [22:13:52] (03CR) 10Dzahn: [C: 032] flake8: upgrade to 3.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/356236 (owner: 10BryanDavis) [22:15:57] 10Operations, 10Gerrit, 10Beta-Cluster-reproducible, 10Release-Engineering-Team (Kanban), 10Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#3321443 (10Paladox) [22:17:44] (03CR) 10Chad: "Ignore this comment, posting for an example task." (031 comment) [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [22:18:01] is it not possible to use puppet compiler for deployment-prep in labs? [22:20:08] (03CR) 10Paladox: [C: 031] Add core + core plugins @ 2.13.8 [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [22:20:16] (03CR) 10Paladox: [C: 031] Configuring git-fat to work with Archiva [software/gerrit] - 10https://gerrit.wikimedia.org/r/356482 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [22:20:19] (03CR) 10Paladox: [C: 031] Adding scap3 config [software/gerrit] - 10https://gerrit.wikimedia.org/r/356484 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [22:20:35] I reviewed those a while ago ^^ i am only adding +1 now :) [22:23:50] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:25:21] mutante: ping? [22:26:02] (03CR) 10Dzahn: "feel free to re-add me if any changes" [puppet] - 10https://gerrit.wikimedia.org/r/145018 (owner: 10ArielGlenn) [22:28:51] (03PS3) 10Bearloga: Add Shiny Server module and Discovery Dashboards role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) [22:31:12] 10Operations, 10Labs, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Set up external DNS record for wikitech-static - https://phabricator.wikimedia.org/T164290#3321486 (10RobH) [22:32:18] urandom: not that i'd know of. it looks at site.pp to find the nodes and there is none in labs [22:32:37] could be that i just dont know how though [22:32:58] saw https://wikitech.wikimedia.org/wiki/Puppet_migration#Puppet_Catalogs_compiler that is about setting it up in vagrant [22:33:05] mutante: oh, yeah. i was going to ask how familiar you are with the cassandra puppetization since _joe_ refactored it [22:33:17] ah, i was trying to answer the compiler question [22:33:34] i am not familiar with the cassandra puppetization in particular [22:33:36] yeah, i think i came to the same conclusion :( [22:33:55] did you have a particular issue ? [22:34:08] yeah, in deployment prep we have two cassandra clusters that are crossed [22:34:11] they've... merged [22:34:25] because one of them got some seeds mixed in from the other [22:34:42] which i'm guessing is the result of inheritance [22:35:03] but everything has sort of changed here [22:35:30] i _think_ he would try to avoid inheritance [22:35:31] the PC question was because i was going to iterate on some educated guesses :) [22:35:38] what are the names of the crossed ones? [22:36:25] 10Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917#3321498 (10Jdlrobson) [22:36:27] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Reading-Web-Backlog, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3321497 (10Jdlrobson) [22:36:32] it's the restbase cluster (consisting of restbase01 and restbase02), and the aqs one (consisting of aqs0[1-3]) [22:37:29] mutante: i was looking at hieradata/labs/deployment-prep/common.yaml, and profile::cassandra::instances lists the two restbase nodes [22:37:47] ok, and what are the names of the roles that they are using [22:38:03] hrmm, restbase and aqs, i think [22:38:27] hmm. then it would be all different from production [22:39:35] so the profile should be used inside a role [22:39:46] and the role should be on the instance [22:40:33] and what is in hieradata/labs/deployment-prep/common.yaml would probably be applied to the whole deployment-prep project then [22:40:46] that would mean it's not based on the role [22:41:22] so that could explain why you see things from there on all the instances [22:42:05] in prod, if something is in hieradata/role/common/ it gets applied on all nodes using that role [22:42:08] yeah, the restbase nodes don't have the aqs nodes in their seeds list, but the aqs nodes have the other aqs nodes and the restbase nodes [22:42:16] unfortunately hieradata/labs follows a different approach [22:42:35] yeah, that was confusing me [22:43:00] i agree [22:43:14] the best fix would be to make it more similar i think [22:43:35] hiera lookup based on role instead of project, and then apply roles on individual instances [22:43:48] (03Abandoned) 10Eevans: WIP: throwing things against Puppet Compiler to see what sticks [puppet] - 10https://gerrit.wikimedia.org/r/357515 (https://phabricator.wikimedia.org/T167222) (owner: 10Eevans) [22:44:07] in horizon you can also do either or, apply a puppet role by instance, by project or even by prefix of the hostname [22:45:02] but that feels like a bigger thing to restructure the whole deployment-prep setup [22:46:34] 6 [22:48:32] (03CR) 10Faidon Liambotis: [C: 04-1] aptrepo: add hp-mcp-stretch and thirdparty/hwraid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/357422 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [22:53:40] (03PS4) 10Paladox: Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) [22:54:32] 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3321579 (10Dzahn) [22:55:20] 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301040 (10Dzahn) 05Open>03Resolved great! thank you. i have removed the network access part from the onboarding. that means all subtasks are resolved and closing this. [22:58:09] 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3321589 (10Dzahn) this would be like T125821 was for jessie [22:58:46] 10Operations, 10Continuous-Integration-Infrastructure: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3321592 (10faidon) Well, first of all, right before I filed this task, Antoine said on IRC: > containers for CI would be for later. The priority has been set toward sta... [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170606T2300). [23:00:04] Jdlrobson, Smalyshev, and matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:17] \o [23:00:23] Present [23:03:10] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:07:31] I can SWAT [23:08:10] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 15 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:08:17] (03PS2) 10Thcipriani: Update ContentNamespaces for Commons Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077) (owner: 10Jdlrobson) [23:08:25] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077) (owner: 10Jdlrobson) [23:09:39] (03Merged) 10jenkins-bot: Update ContentNamespaces for Commons Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077) (owner: 10Jdlrobson) [23:09:48] (03CR) 10jenkins-bot: Update ContentNamespaces for Commons Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357426 (https://phabricator.wikimedia.org/T167077) (owner: 10Jdlrobson) [23:10:08] (03CR) 10Dzahn: "alright, got back to this one and had to remember myself what we said here. so if the package does provide the traditional sysvinit init s" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [23:10:15] Reedy: around? [23:10:34] jdlrobson: contentnamespaces patch is live on mwdebug1002, check please [23:10:39] on it! [23:10:59] it works thcipriani [23:11:02] sync away [23:11:06] * thcipriani does [23:12:41] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:357426|Update ContentNamespaces for Commons Wiki]] T167077 (duration: 00m 46s) [23:12:47] ^ jdlrobson live now [23:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:53] T167077: Use wgContentNamespaces instead of $wgMFContentNamespace - https://phabricator.wikimedia.org/T167077 [23:13:01] (03PS2) 10Thcipriani: Disable page previews on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894) (owner: 10Jdlrobson) [23:13:03] looks good! [23:13:07] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894) (owner: 10Jdlrobson) [23:14:24] (03Merged) 10jenkins-bot: Disable page previews on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894) (owner: 10Jdlrobson) [23:15:13] jdlrobson: ^ is live on mwdebug1002, check please [23:15:19] on it [23:15:27] SMalyshev: ping for SWAT [23:15:41] (03CR) 10jenkins-bot: Disable page previews on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357425 (https://phabricator.wikimedia.org/T166894) (owner: 10Jdlrobson) [23:15:42] thcipriani: that's also good [23:15:48] * thcipriani syncs [23:17:17] (03CR) 10Dzahn: [C: 04-1] "i'll turn -2 into -1 for now, it's a good point but we gotta make sure there is no difference between the 2 files" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [23:17:17] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:357425|Disable page previews on wikispecies]] T166894 (duration: 00m 44s) [23:17:24] ^ jdlrobson live now [23:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:25] T166894: Disable page previews on wikispecies - https://phabricator.wikimedia.org/T166894 [23:17:52] thcipriani: all good! [23:18:00] cool, thanks for checking :) [23:18:25] blerg, should have started merging these flow patches sooner :\ [23:24:22] https://test2.wikipedia.org/wiki/Main_Page - 500 - /php-1.30.0-wmf.4/includes/parser/Parser.php: Tag hook for noexternallanglinks is not callable [23:25:32] Wikidata breaking things? [23:26:42] thcipriani: ^ [23:27:13] oh good. [23:29:01] Filed a task just now [23:29:18] subtask of wmf.4 blockers [23:29:28] https://phabricator.wikimedia.org/T167238 [23:29:31] Krinkle: thanks, RainbowSprinkles ^ FYI [23:30:23] I think we're just on testwikis with wmf.4 so I may leave it for folks to investigate. MediaWiki is on wmf.2 [23:30:36] Krinkle: I already rolled mw.org back to wmf.2 because of T167216 [23:30:36] T167216: Preferences page messed up on Test Wikipedia (1.30.0-wmf.4) - https://phabricator.wikimedia.org/T167216 [23:30:45] (so only running on test(2) and friends) [23:30:54] OK [23:33:22] matt_flaschen: hrrrm, jenkins didn't like your patches for SWAT for some reason :\ [23:34:28] oh, composer + https://status.github.com/ [23:35:08] thcipriani, yeah, just checked, both are showing that. [23:35:24] 10Operations, 10DNS, 10Traffic: Redirect status.wikipedia.org to status.wikimedia.org - https://phabricator.wikimedia.org/T167239#3321697 (10Ladsgroup) [23:37:35] thcipriani, I re-did the gate, but I'm not crossing my fingers. This is a bad bug, but it's not a new bug, so not sure. I lean towards waiting until it can gate normally. ^ RoanKattouw [23:39:03] thcipriani, it says it's recovering now: https://status.github.com/ [23:39:05] 19:38 EDT [23:39:06] Our systems are recovering from the interruption of one of our core data services. [23:39:48] 16:24:21 https://test2.wikipedia.org/wiki/Main_Page - 500 - /php-1.30.0-wmf.4/includes/parser/Parser.php: Tag hook for noexternallanglinks is not callable [23:39:48] 16:25:31 Wikidata breaking things? [23:39:56] matt_flaschen: Is that related to your/our Wikidata change for RCF? ---^^ [23:40:17] I think ours broke noexternallanglinks, not to say that it must be our change that broke it now, but it is suspicious [23:40:26] Maybe the Wikidata people refactored it and broke it, but who knows [23:43:01] RoanKattouw, I was also suspicious and wondering about that. Our patch was almost 3 months ago so I put it back down (thinking it probably wasn't broken that long), but I'll check for sure (it's probably not a widely used magic word, and maybe someone just recently added it to test2) [23:43:13] I can probably track it down now regardless. [23:46:51] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [23:52:20] matt_flaschen: changes for flow for wmf.2 and wmf.4 live on mwdebug1002, check please [23:55:03] !log gerrit: force stopping for a second to reindex accounts [23:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:10] thcipriani, works, good to go for everywhere. [23:56:16] matt_flaschen: ok, going live [23:56:17] gerrit acts like it's down [23:56:22] Posted a test topic at https://gom.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BE%E0%A4%AA%E0%A4%B0%E0%A4%AA%E0%A5%80_%E0%A4%9A%E0%A4%B0%E0%A5%8D%E0%A4%9A%E0%A4%BE:STACEY_MESQUITA then hid it. [23:56:23] !log gerrit: back from reindexing [23:56:26] nope [23:56:29] false alarm [23:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:32] Amir1: See SAL, on purpose for 2 seconds :) [23:56:33] Bad timing [23:56:52] It seems I always am [23:57:09] :P [23:58:23] !log thcipriani@tin Synchronized php-1.30.0-wmf.4/extensions/Flow/includes/Content/BoardContentHandler.php: SWAT: [[gerrit:357501|Revert "Throw when unserializing invalid Flow workflow metadata JSON"]] T166100 T156813 (duration: 00m 45s) [23:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:32] T166100: MWContentSerializationException: Failed to decode blob. It should be JSON representing valid Flow metadata. - https://phabricator.wikimedia.org/T166100 [23:58:32] T156813: MWContentSerializationException in Konkani Wikipedia (gomwiki) - https://phabricator.wikimedia.org/T156813 [23:59:10] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:59:43] !log thcipriani@tin Synchronized php-1.30.0-wmf.2/extensions/Flow/includes/Content/BoardContentHandler.php: SWAT: [[gerrit:357500|Revert "Throw when unserializing invalid Flow workflow metadata JSON"]] T166100 T156813 (duration: 00m 43s) [23:59:49] ^ matt_flaschen live everywhere [23:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log