[00:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T0000). [00:00:59] Niharika: cirrussearch still needs to be synced out with the submodule update, i can do that if needed [00:05:19] !log ebernhardson@naos Synchronized php-1.29.0-wmf.21/extensions/CirrusSearch/includes/Searcher.php: cirrus: align sister search boost template config variable with documentation (duration: 00m 50s) [00:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:35] (03PS1) 10Jgreen: switch fundraisingdb-wmnet [dns] - 10https://gerrit.wikimedia.org/r/350502 [00:14:31] !log starting phabricator update [00:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:06] (03CR) 10Jgreen: [C: 032] switch fundraisingdb-wmnet [dns] - 10https://gerrit.wikimedia.org/r/350502 (owner: 10Jgreen) [00:23:00] ebernhardson: Did you say "`git log` in extensions/CirrusSearch shows its updated now, and it no longer displays in `git status`" [00:23:15] I did do a submodule update for it. [00:25:35] PROBLEM - PHD should be running on iridium is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 997 (phd) [00:25:49] * twentyafterfour is on it [00:26:15] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [00:26:21] ACKNOWLEDGEMENT - PHD should be running on iridium is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 997 (phd) 20after4 maintenance [00:26:24] ACKNOWLEDGEMENT - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) 20after4 maintenance [00:26:40] apologies I forgot to silence icinga before maintenance [00:27:45] RECOVERY - PHD should be running on iridium is OK: PROCS OK: 1 process with regex args php ./phd-daemon, UID = 997 (phd) [00:28:15] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 21 processes with UID = 997 (phd) [00:28:40] (03PS1) 10Eevans: Update collector version to 3.1.4 [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/350503 (https://phabricator.wikimedia.org/T163936) [00:28:52] * twentyafterfour wonders why we have 2 alerts for phd [00:36:16] (03CR) 10Eevans: [C: 031] "Note: This deploys the 3.1.4 jar, once it is in place everywhere, a subsequent Puppet changeset will be needed to link it." [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/350503 (https://phabricator.wikimedia.org/T163936) (owner: 10Eevans) [00:45:00] 06Operations: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#3216445 (10faidon) FWIW, besides T102099 which indeed a duplicate or parent, there is also the much newer T163196 which is actually very close to being resolved now. It would solve all of the above consider... [01:08:33] (03CR) 10Krinkle: Use EtcdConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [01:09:09] (03PS1) 10Jforrester: Decom legacy citoid service hostname [puppet] - 10https://gerrit.wikimedia.org/r/350505 (https://phabricator.wikimedia.org/T133001) [01:11:18] (03CR) 10Krinkle: Use EtcdConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [01:11:32] (03CR) 10Jforrester: [C: 04-1] "Not before 1 May. Also, I have no real idea what I'm doing here." [puppet] - 10https://gerrit.wikimedia.org/r/350505 (https://phabricator.wikimedia.org/T133001) (owner: 10Jforrester) [02:07:39] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [02:08:38] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [02:16:42] !log Reset 2FA for T163931 on labswiki [02:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:44] !log running populateEditCount.php in screen on wast for T163854, counting edits for board vote eligibility [02:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:55] T163854: Create voter lists for Board & FDC Elections 2017 - https://phabricator.wikimedia.org/T163854 [02:37:26] (03Abandoned) 10Jdlrobson: Disable RelatedSites on English, French and Italian Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335830 (https://phabricator.wikimedia.org/T128326) (owner: 10Jdlrobson) [03:06:48] PROBLEM - Disk space on ms-be1039 is CRITICAL: DISK CRITICAL - free space: / 1855 MB (3% inode=86%) [03:40:46] !log running kafka replica election to bring kafka1018 back as preferred leader [03:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:01] !log starting kafka broker on kafka1020 [03:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:36] 06Operations, 10ops-eqiad, 10Phabricator: phab1001 hdd port a failure - https://phabricator.wikimedia.org/T163960#3216603 (10Dzahn) [04:12:36] (03CR) 10Dzahn: "can you add some more details what this does / a link to some docs for this option please" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [04:17:36] (03CR) 10Krinkle: "Added unit tests - for your review at I6e647985b98724." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [05:03:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/1/3: down - Core: cr2-eqiad:xe-4/1/3 (Level3, BDFS2448, 84ms) {#A0010621} [10Gbps wave]BR [05:03:38] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/1/3: down - Core: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]BR [05:21:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 [05:21:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:24:30] !log cleaning some rows in ores_classification in enwiki (T159753) [05:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:41] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [05:34:38] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [05:37:08] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [05:37:58] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89981.19 seconds [05:41:18] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [05:42:48] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=658.70 Read Requests/Sec=3208.70 Write Requests/Sec=9.00 KBytes Read/Sec=13543.20 KBytes_Written/Sec=87.60 [05:43:18] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [05:52:48] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=46.60 Read Requests/Sec=4.60 Write Requests/Sec=71.70 KBytes Read/Sec=32.80 KBytes_Written/Sec=561.60 [05:59:54] !log Deploy alter table labsdb1003 (wikidatawiki) https://phabricator.wikimedia.org/T162539%C2%A0https://phabricator.wikimedia.org/T163548 [06:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:20] (03CR) 10Chad: [C: 04-1] "We should set this to HTTP, not LDAP. I want to keep using the randomly-generated passwords, not have people use their LDAP passwords here" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [06:08:59] !log Deploy alter table on s5 (wikidatawiki) on db1049 - T130067 T162539 [06:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:10] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [06:09:10] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [06:11:00] !log Deploy alter table on s5 (wikidatawiki) on db1070 (running locally instead of neodymium as this host will be affected by the network maintenance) - T130067 T162539 [06:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:33] !log Deploy alter table on s5 (wikidatawiki) on db1070 (running locally instead of neodymium as this host will be affected by the network maintenance) - T163548 [06:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:44] !log Deploy alter table on s5 (wikidatawiki) on db1049 - T163548 [06:14:44] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [06:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:27] !log Deploy schema change on s7 metawiki.pagelinks to remove partitioning on db1041 - T153300 [06:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:37] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [06:23:35] <_joe_> !log moving orphaned objects in ms-be1039's root partition in sdc1/stale_root to save space [06:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:29] 06Operations, 10DBA: Drop database table "hashs" from Wikimedia wikis - https://phabricator.wikimedia.org/T54927#3216721 (10Marostegui) 05Open>03Resolved a:03Marostegui This has been dropped from the random places where it existed (it had 0 rows everywhere): s2: bgwiktionary enwikiquote enwiktionary s3... [06:39:40] !log Logging for the record: drop table hashs from s2, s3 and s7 (only places where it existed) - T54927 [06:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:49] T54927: Drop database table "hashs" from Wikimedia wikis - https://phabricator.wikimedia.org/T54927 [06:44:28] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:45:44] !log Reboot es1011 for kernel upgrade - T162029 [06:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:51] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [06:47:08] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 430 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:49:28] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 15 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:50:20] !log executed kafka preferred-replica-election to rebalance topic leaders in the analytics cluster after maintenance [06:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:09] 06Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029#3150214 (10Marostegui) es1011 has been rebooted: ``` root@es1011:~# uname -r 4.9.0-0.bpo.2-amd64 ``` [06:51:58] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 430 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:53:39] !log Reboot es1014 for kernel upgrade - T162029 [06:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:47] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [07:00:42] 06Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029#3216740 (10Marostegui) es1014 has been rebooted: ``` root@es1014:~# uname -r 4.9.0-0.bpo.2-amd64 ``` [07:06:50] 06Operations, 10netops: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542#3216758 (10ayounsi) p:05Triage>03Lowest [07:11:03] (03PS1) 10Marostegui: db-eqiad.php: Repool hosts out for net maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350518 (https://phabricator.wikimedia.org/T162681) [07:12:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool hosts out for net maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350518 (https://phabricator.wikimedia.org/T162681) (owner: 10Marostegui) [07:13:28] (03Merged) 10jenkins-bot: db-eqiad.php: Repool hosts out for net maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350518 (https://phabricator.wikimedia.org/T162681) (owner: 10Marostegui) [07:13:40] (03CR) 10jenkins-bot: db-eqiad.php: Repool hosts out for net maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350518 (https://phabricator.wikimedia.org/T162681) (owner: 10Marostegui) [07:15:48] RECOVERY - Disk space on ms-be1039 is OK: DISK OK [07:16:31] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Repool hosts that needed to be moved for the network maintenance - T162681 (duration: 02m 32s) [07:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:41] T162681: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681 [07:19:48] PROBLEM - Disk space on ms-be1039 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=85%): /tmp/rootmirror 0 MB (0% inode=85%) [07:22:45] 06Operations, 13Patch-For-Review: Logrotate fails on mediawiki maintenance servers on jessie - https://phabricator.wikimedia.org/T163555#3216772 (10MoritzMuehlenhoff) 05Open>03Resolved This is fixed after merging the patches listed in this bug, last night's logrotate on wasat was free of errors. [07:22:46] 06Operations, 13Patch-For-Review, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3216774 (10MoritzMuehlenhoff) [07:27:48] RECOVERY - Disk space on ms-be1039 is OK: DISK OK [07:42:53] (03PS1) 10Ema: 4.1.5-1wm3: add exp_thread_rt and exp_lck_inherit [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350520 (https://phabricator.wikimedia.org/T145661) [07:43:50] jynus: added: https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0April.C2.A027 Does it sound good? [07:44:52] I'm running away for lunch and will be back in one hour to start the clean up [07:45:45] <_joe_> Amir1: is ores working from eqiad right now? [07:46:00] _joe_: yeah, since last night (deployment) [07:46:07] https://wikitech.wikimedia.org/wiki/Incident_documentation/20170426-ORES [07:46:26] My plan for today includes looking at this and trying to get it resolved [07:47:25] <_joe_> so there is an ongoing outage on ORES, you're telling me [07:47:37] no [07:47:48] I asked to put on the deployment calendar [07:47:58] long-running database maintenance [07:48:07] as it is releng policy [07:48:35] <_joe_> jynus: why "no"? [07:48:39] _joe_: yeah, that wasn't not solution in my opinion either [07:48:41] <_joe_> ores in codfw is unusable [07:48:54] ok, I meant [07:48:56] jynus: I added the window as you said. [07:48:57] <_joe_> oh the deployment :P [07:49:11] that that is what he was saying [07:49:25] if there is an outage ongoing, I do not know [07:49:26] "09:00–11:00 UTC # 02:00–04:00 PDT 13:30–15:30 UTC+4.5 ORES tables cleanup Amir Sarabadani (Amir1) Reduce ores_classification table size on enwiki (removing 80M rows)" [07:49:29] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3216866 (10ayounsi) [07:49:50] (03CR) 10Ema: [V: 032 C: 032] 4.1.5-1wm3: add exp_thread_rt and exp_lck_inherit [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350520 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [07:50:06] These are two completely different topics, one is about cleaning up the ores_classification table. One is about https://wikitech.wikimedia.org/wiki/Incident_documentation/20170426-ORES [07:55:01] !log varnish 4.1.5-1wm3 uploaded to apt.w.o T145661 [07:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:10] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [07:56:50] !log aqs100[69] back serving AQS traffic [07:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:36] (03PS1) 10Ema: cache: hiera flag to enable expiry thread RT experiment [puppet] - 10https://gerrit.wikimedia.org/r/350526 (https://phabricator.wikimedia.org/T145661) [08:14:35] (03PS2) 10Ema: cache: hiera flag to enable expiry thread RT experiment [puppet] - 10https://gerrit.wikimedia.org/r/350526 (https://phabricator.wikimedia.org/T145661) [08:19:06] !log upgrade varnish to 4.1.5-1wm3 on cp2024 [08:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/3/2: down - Transit: NTT (service ID 234630) {#3475} [10Gbps]BR [08:21:48] (03CR) 10Ema: [V: 032 C: 032] cache: hiera flag to enable expiry thread RT experiment [puppet] - 10https://gerrit.wikimedia.org/r/350526 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [08:25:19] !log restart varnish-be on cp2024 with expiry thread RT experiment enabled [08:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:55] 06Operations: Ops pwstore currently read-only - https://phabricator.wikimedia.org/T163942#3216935 (10MoritzMuehlenhoff) That's caused by Alex's key, which expired two days ago: pub rsa4096 2015-04-27 [SC] [expired: 2017-04-25] 5FF346D51268D1468A070853AB640DA3D40B305A uid [ expired] Alexandros... [08:29:34] !log ms-be1039 issue "controller slot=3 pd 1I:1:5 modify disablepd" to force failed sdc - T163690 [08:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:41] T163690: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T163690 [08:32:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:47:38] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [08:49:38] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [08:50:36] !log installing django security updates [08:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:04] <_joe_> !log restarting redis rdb1001:6380 after cleaning up the current AOF files for investigation of T163337 [08:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:12] T163337: Watchlist entries duplicated several times - https://phabricator.wikimedia.org/T163337 [08:55:09] !log deploying alter table to all wikis on s6 T163979 [08:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:17] T163979: Convert unique keys into primary keys for some wiki tables on s6 - https://phabricator.wikimedia.org/T163979 [09:00:04] Amir1: Dear anthropoid, the time has come. Please deploy ORES tables cleanup (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T0900). [09:03:21] on it [09:25:37] <_joe_> !log restarting all redis instances for jobqueues on eqiad to force a full resync with masters in codfw T163337 [09:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:48] T163337: Watchlist entries duplicated several times - https://phabricator.wikimedia.org/T163337 [09:36:12] (03PS3) 10Volans: Traffic: add automatic verification of the changes [switchdc] - 10https://gerrit.wikimedia.org/r/349879 (https://phabricator.wikimedia.org/T163373) [09:46:28] !log upgrading mysql on bohrium/piwik [09:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:08] PROBLEM - MariaDB Slave SQL: s6 on db1093 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1068, Errmsg: Error Multiple primary key defined on query. Default database: frwiki. [Query snipped] [10:17:33] ^ expected [10:27:18] PROBLEM - puppet last run on ms-be1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 28 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdc] [10:28:08] RECOVERY - MariaDB Slave SQL: s6 on db1093 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:29:28] PROBLEM - HP RAID on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:42:46] 06Operations: Ops pwstore currently read-only - https://phabricator.wikimedia.org/T163942#3217140 (10akosiaris) Done. I 've just extended it another 2 years and published it. It seems like it will take a while for all the keyservers to be updated. Right now https://sks-keyservers.net/pks/lookup?op=vindex&search=... [10:51:58] 06Operations, 10Ops-Access-Requests, 10Deployment-Systems: Enable keyholder for ORES deployments - https://phabricator.wikimedia.org/T163939#3215512 (10akosiaris) No, a new "identity" is not required. ORES reuses `deploy-service` as seen in https://github.com/wikimedia/puppet/blob/production/modules/service/... [10:52:58] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2299.84 seconds [10:52:58] PROBLEM - MariaDB Slave Lag: s6 on db1050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2303.48 seconds [10:52:58] PROBLEM - MariaDB Slave Lag: s6 on db1030 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2304.41 seconds [10:53:08] PROBLEM - MariaDB Slave Lag: s6 on db1037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2308.48 seconds [10:53:08] PROBLEM - MariaDB Slave Lag: s6 on db1023 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2308.54 seconds [10:53:59] more downtimes being lost [10:57:13] (03CR) 10Volans: [C: 032] Traffic: add automatic verification of the changes [switchdc] - 10https://gerrit.wikimedia.org/r/349879 (https://phabricator.wikimedia.org/T163373) (owner: 10Volans) [11:06:31] (03PS3) 10Volans: DNS: add removal of confd stale files [switchdc] - 10https://gerrit.wikimedia.org/r/349880 (https://phabricator.wikimedia.org/T163376) [11:10:49] (03CR) 10Volans: [C: 032] DNS: add removal of confd stale files [switchdc] - 10https://gerrit.wikimedia.org/r/349880 (https://phabricator.wikimedia.org/T163376) (owner: 10Volans) [11:13:43] (03PS2) 10Volans: MediaWiki: reduce verbosity of the cache warmup [puppet] - 10https://gerrit.wikimedia.org/r/349787 (https://phabricator.wikimedia.org/T163369) [11:13:57] 06Operations: Ops pwstore currently read-only - https://phabricator.wikimedia.org/T163942#3217159 (10MoritzMuehlenhoff) 05Open>03Resolved After a "pws update-keyring" adding a new entry to pwstore now works fine again. [11:16:55] (03CR) 10Volans: [C: 032] MediaWiki: reduce verbosity of the cache warmup [puppet] - 10https://gerrit.wikimedia.org/r/349787 (https://phabricator.wikimedia.org/T163369) (owner: 10Volans) [11:26:18] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:18] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:18] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:28] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:28] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:28] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:29] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:29] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:29] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:29] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:30] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:26:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:30:11] <_joe_> someone removed a graph that was being tested I guess [11:30:17] <_joe_> let me check [11:32:58] <_joe_> yes, http://www.pbs.org/newshour/making-sense/care-peoples-kids/ returns 404 now [11:33:29] <_joe_> if no one manages to, I'll open a ticket when I'm back [11:33:44] I can do that [11:36:28] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [11:39:26] https://phabricator.wikimedia.org/T163986 [11:39:28] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:41:18] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [11:41:18] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:41:28] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:41:28] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [11:41:28] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [11:41:28] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [11:41:28] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [11:42:28] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [11:42:29] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:42:29] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:42:29] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [11:43:18] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [11:44:28] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:44:28] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:44:28] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:44:28] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:44:28] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:45:18] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:45:18] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) [11:46:18] (03PS3) 10Alexandros Kosiaris: changeprop: Remove the ores_uri parameter [puppet] - 10https://gerrit.wikimedia.org/r/345827 (https://phabricator.wikimedia.org/T159615) [11:46:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] changeprop: Remove the ores_uri parameter [puppet] - 10https://gerrit.wikimedia.org/r/345827 (https://phabricator.wikimedia.org/T159615) (owner: 10Alexandros Kosiaris) [11:46:26] (03CR) 10jerkins-bot: [V: 04-1] changeprop: Remove the ores_uri parameter [puppet] - 10https://gerrit.wikimedia.org/r/345827 (https://phabricator.wikimedia.org/T159615) (owner: 10Alexandros Kosiaris) [11:47:28] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [11:47:42] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Indeed. thanks!." [puppet] - 10https://gerrit.wikimedia.org/r/345827 (https://phabricator.wikimedia.org/T159615) (owner: 10Alexandros Kosiaris) [11:48:28] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [11:50:00] !log Upgrade mariadb from 10.0.22 to 10.0.28 on es1015 [11:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:18] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:50:18] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [11:50:28] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [11:50:28] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [11:50:28] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:59:58] RECOVERY - MariaDB Slave Lag: s6 on db1030 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [12:05:23] PROBLEM - MariaDB Slave IO: es3 on es2018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es1014.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es1014.eqiad.wmnet (111 Connection refused) [12:05:40] ^ expected and I silenced it before [12:05:46] ack [12:06:23] RECOVERY - MariaDB Slave IO: es3 on es2018 is OK: OK slave_io_state Slave_IO_Running: Yes [12:06:45] !log Upgrade es1011 and es1014 from mariadb 10.0.22 to mariadb 10.0.28 [12:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:08] RECOVERY - MariaDB Slave Lag: s6 on db1023 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [12:10:47] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3041135 (10Marostegui) As we spoke on IRC, there is one use case that might be useful for the DBAs. When we release packages, that doesn't mean that we are going to upgrade to them straightaway, in fact we mig... [12:11:48] (03PS1) 10Elukey: Avoid unnecessary runs for zk-init [puppet/cdh] - 10https://gerrit.wikimedia.org/r/350542 [12:11:58] RECOVERY - MariaDB Slave Lag: s6 on db1050 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:13:08] RECOVERY - MariaDB Slave Lag: s6 on db1037 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [12:20:39] (03CR) 10Addshore: WMDE Spring campaign - Remove logging (no longer needed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347818 (owner: 10Addshore) [12:20:43] (03CR) 10Addshore: wmgUseGettingStarted true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347820 (owner: 10Addshore) [12:22:14] (03PS2) 10Faidon Liambotis: Rename ipaddress_primary to ipaddress (same for 6) [puppet] - 10https://gerrit.wikimedia.org/r/350254 (https://phabricator.wikimedia.org/T163196) [12:23:27] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/350254 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [12:23:44] (03CR) 10Faidon Liambotis: [C: 032] Rename ipaddress_primary to ipaddress (same for 6) [puppet] - 10https://gerrit.wikimedia.org/r/350254 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [12:24:44] this is changing the definitions of $ipaddress and $ipaddress6 across the fleet [12:24:54] * volans standing by [12:25:01] in some cases the values as well, hopefully only when wrong [12:26:52] puppet runs will be noisy with /etc/ssh/ssh_known_hosts diffs over the next hour [12:27:08] e.g. [12:27:09] -lvs2002.codfw.wmnet,lvs2002,10.192.1.2,2620:0:860:ed1a::3:d ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBBQO08i0W+o9ARPrk9we9kXuQDdApwX94loKS/2CwlXJJCDbH2SVDvJBxQxJs//DBILSR2ZBf2gkLCMOGdz6/IE= [12:27:13] +lvs2002.codfw.wmnet,lvs2002,10.192.1.2,2620:0:860:101:10:192:1:2 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBBQO08i0W+o9ARPrk9we9kXuQDdApwX94loKS/2CwlXJJCDbH2SVDvJBxQxJs//DBILSR2ZBf2gkLCMOGdz6/IE= [12:27:17] which is actually a fix :) [12:27:39] but up to 30 mins for all hosts to have caught up with the fact change, then another 30 mins for all hosts to get the new ssh_known_hosts [12:31:38] (03PS5) 10Faidon Liambotis: Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) [12:34:35] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3217281 (10BBlack) Some general updates on bin-sizing estimates: Based on the graph data for available bytes in each bin and comparing how fast they initially re... [12:38:18] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:39:19] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3217291 (10Marostegui) That is fine by us, but then we probably want to go ahead and fix this: T159266 [12:41:10] * volans looking ^^^ [12:41:28] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: $resources[0] is :undef, not a hash or array at /etc/puppet/modules/profile/manifests/redis/slave.pp:10 on node rdb1008.eqiad.wmnet [12:41:32] paravoid: ^^^ [12:41:51] is that me? [12:41:58] not sure, I'm checking the code [12:42:07] sounds like _joe_ [12:42:30] $password = $resources[0]['parameters']['settings']['requirepass'] [12:42:35] yeah I would say unrelated [12:43:38] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:38] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:47] _joe_: ^^ [12:44:30] <_joe_> I didn't touch redis today [12:45:12] <_joe_> let me look anyways [12:45:15] mmmh we're querying puppetdb there [12:45:27] $resources = query_resources("fqdn='${master}'", 'Redis::Instance', false) [12:45:38] <_joe_> yes [12:47:58] what changed that could affect this? timing is suspicious with out change [12:48:37] 06Operations, 10ops-eqiad: rack and cable frlog1001 - https://phabricator.wikimedia.org/T163127#3217348 (10Jgreen) @Cmjohnson we migrated off of db1025 and powered it down, so that can be unracked to make space for frlog1001. [12:48:39] <_joe_> uhm that rename maybe? [12:49:06] we've overriden ipaddress and ipaddress6 in facter [12:49:16] with the _primary version that was unused [12:51:08] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:51:44] <_joe_> in fact puppetdb has no Redis::Instance declared on rdb2001 [12:51:48] <_joe_> wtf [12:52:09] I'm running puppet on 2001 [12:52:14] completed successfully [12:52:28] <_joe_> volans: look at the logs [12:52:30] <_joe_> omg [12:52:39] <_joe_> volans: can you disable puppet on all rdbs? [12:52:40] <_joe_> please [12:52:42] sure [12:53:12] <_joe_> ok puppet is broken on the RDBs [12:53:23] <_joe_> ah! [12:53:29] !log disabled puppet on rdb* [12:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:58] win 58 [12:54:51] <_joe_> modules/profile/manifests/redis/multidc.pp: $ip = $facts['ipaddress_primary'] [12:55:08] <_joe_> what should I use instead? ipaddress? [12:55:12] yes [12:55:59] and it's the only place where it's used.. so should be it [12:56:18] (03PS1) 10Giuseppe Lavagetto: profile::redis::multidc: use ipaddress, not ipaddress_primary [puppet] - 10https://gerrit.wikimedia.org/r/350548 [12:57:14] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/350548 (owner: 10Giuseppe Lavagetto) [12:57:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [12:57:33] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::redis::multidc: use ipaddress, not ipaddress_primary [puppet] - 10https://gerrit.wikimedia.org/r/350548 (owner: 10Giuseppe Lavagetto) [12:57:34] checkin fatals [12:58:27] People can't login, I guess you know, any information? [12:58:42] no, I don't know [12:58:44] _joe_ what's the status of redis? [12:58:46] let's merge this [12:59:00] I just connected [12:59:32] * zhuyifei1999_ can't edit "Sorry! We could not process your edit due to a loss of session data. " forever... [12:59:43] <_joe_> volans: I'm running puppet on them [12:59:47] ok [13:00:00] fyi oauth isnt working for mutiple users keep getting E4 myself (reported in #wikimedia-tech) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T1300). Please do the needful. [13:00:04] Amir1 and jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T1300). Please do the needful. [13:00:15] oh someone used ipaddress_primary? [13:00:16] haha [13:00:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [13:00:26] sorry I didn't even check as it was so recent [13:00:30] o/ [13:00:31] o/ [13:00:32] <_joe_> yes, I think the reason is what just happened [13:00:39] <_joe_> please do not SWAT now [13:00:40] please hold the swat for a bit [13:00:44] <_joe_> there is an outage going on [13:00:50] did this actually cause an outage? [13:01:12] yes, see reportts [13:01:14] We got report on fr.wiktionary and fr.wikipedia about session issues / can't login issues [13:01:16] I can login now [13:01:17] <_joe_> ferm was obliterated [13:01:17] o/ [13:01:19] goddamn [13:01:25] en.wiki is affeced as well Dereckson [13:01:26] <_joe_> so redis was unable to be reached [13:01:28] can you check again, Dereckson, jem [13:01:30] <_joe_> I'm working on it [13:01:35] <_joe_> jynus: not now [13:01:39] oh [13:01:40] sorry [13:01:44] Amir1, jan_drewniak: do you want to deploy your own changes? [13:01:51] (not sure if you are deployers) [13:01:57] zeljkof: hold the SWAP please, outage ongoing [13:02:00] I can deploy mine [13:02:08] jynus: I could login 30 seconds ago, after two "please go back and retry" [13:02:09] jynus: yup works for user reporting the issue [13:02:11] volans: ok, swat on hold [13:02:14] After the outage of course [13:02:23] Amir1, jan_drewniak: swap on hold, outage in progress [13:02:23] Dereckson, jem we are not 100% ok [13:02:35] Ok [13:02:36] zeljkof: I'm not a 'deployer' :/ but my patch is pretty easy [13:02:51] jan_drewniak: later [13:03:02] <_joe_> ok it should be ok now [13:03:07] volans: thanks, did not notice, please ping me when we can start the swat [13:03:16] <_joe_> even a minute ago or so [13:03:22] <_joe_> two, even [13:03:23] I can still login [13:03:32] Login worked for me just now [13:03:36] on enwiki [13:03:43] <_joe_> jynus: it's resolved, it was just the firewall blocking connections to redis [13:03:45] an edit [13:03:48] *and [13:04:02] sorry, _joe_ it worked before [13:04:09] its okay now [13:04:11] thanks guys [13:04:11] (03PS1) 10Elukey: Swap mc1001->mc1012 with mc1019->mc2030 [puppet] - 10https://gerrit.wikimedia.org/r/350549 (https://phabricator.wikimedia.org/T137345) [13:04:14] so I assumed it had been solved then [13:04:18] I am not merging now [13:04:21] I just connected [13:04:25] Amir1: do you want to deploy both changes? (after the outage) or should I deploy the second one? [13:04:28] and didn't have all the context [13:04:53] let me see editing/connection/errors impact [13:04:57] <_joe_> jynus: I think the outage was progressively getting worse, that ended when the last session redis ran puppet [13:05:19] <_joe_> I don't dare look at the jobqueue now [13:05:30] and yet no page! [13:05:30] zeljkof: let me see [13:05:33] so let's fix that [13:05:52] 12:34-13:00 errors [13:05:52] that seems easy [13:06:05] so I assume that is the windows of "potential problems" [13:06:09] *window [13:06:24] jan_drewniak: just soemthing like cd /srv/mediawiki-staging/ sync-portals [13:06:25] :43 is when the first puppet run failed [13:06:27] I will check login failures specifically [13:06:36] ah no, :38 [13:06:56] <_joe_> paravoid: no page because puppet removed all redis instances logically from the machines, so it removed the nrpe checks too [13:07:04] this one looks good https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&panelId=13&fullscreen&orgId=1&from=now-6h&to=now [13:07:07] <_joe_> but yes, a higher level page [13:07:12] and that's why we usually have higher-level service checks :) [13:07:14] to see edit impact [13:07:15] yeah [13:07:15] <_joe_> would've been nicer [13:07:20] elukey, thanks [13:07:47] can I get rights to change the topic here? [13:08:07] I am normally useless, but I can at least do that when something is ongoing :-) [13:08:41] so how did this escalate from one fact failing to puppet killing the service? :) [13:08:48] Amir1: yeah something like that, but I think you have to specify the staging or production server [13:09:30] there are so many levels of indirections I can't even figure out where $ip is being used [13:09:32] jynus: _joe_ oauth is okay now I can now login properly thanks for fixing that so quickly [13:09:49] _joe_ fixed it, he saved the day [13:09:56] maybe volans helped? [13:10:22] jynus: I'm usually useless as you :-P [13:10:30] Amir1: it basically just syncs the files https://github.com/wikimedia/portals/blob/master/sync-portals [13:10:40] only *you* can make yourself useless [13:10:49] <_joe_> paravoid: not killing the service, just making ferm close all the ports [13:10:51] if instance['host'] == @ip && instance['port'] == @title.to_i [13:10:53] there still some edits faliures [13:11:02] jan_drewniak: ah I see [13:11:11] _joe_: where? [13:11:12] not sure if monitoring lag or people got unauthenticated [13:11:19] _joe_: and you also mentioned nrpe getting removed? [13:11:38] Amir1: and I think it's run from the root of the repo, which is in mediawiki-config/portals [13:11:39] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:11:39] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [13:11:41] <_joe_> paravoid: yes, sorry, remembering all of that now [13:12:04] <_joe_> so it extracts which instances we want from redis::shards which is a common data structure [13:12:13] so this was unfortunate and it's definitely my bad for not checking ipaddress_primary beforehand [13:12:55] but I'm interested to find out at least those two things: a) why didn't we get paged (= we need a service check) b) why did an innocuous catalog failure resulted in a full-blown service outage [13:13:00] find out and fix I mean :) [13:13:09] please tell me once the outage is over [13:13:11] <_joe_> it's not a catalog failure alas [13:13:18] <_joe_> the catalog succeeded on the masters [13:13:26] <_joe_> what we saw here was a fallout on the slaves [13:13:40] so the catalog failures were unrelated? [13:13:44] because we did have those too [13:13:45] <_joe_> puppet assumed exactly zero instances of redis needed to be configured on that machine [13:13:48] where on the slaves [13:13:56] <_joe_> those were on local slaves within the same DC [13:14:07] <_joe_> because they lost the information on the masters [13:14:18] so a change disabled the service and the monitoring? [13:14:22] <_joe_> actually that was fortunate in this case as it pointed me in the right direction [13:14:29] ok, I'm getting lost, I'll start an incident report [13:14:46] <_joe_> paravoid: I can help ofc [13:15:37] ahhh so we saw alarms for puppet failing on slaves due to no masters available, when puppet already had knocked down ferm on the masters [13:15:46] <_joe_> yes [13:16:08] <_joe_> we would've shortly seen redis replication alerts too from the slaves [13:16:11] unfortunately yes [13:16:32] _joe_ not super shortly since there are some retries [13:16:44] <_joe_> the issue being here, we use the same data structure to populate nutcracker and to configure redis [13:16:47] thing that now we might want to revert (that is my bad) [13:17:09] <_joe_> which is of course good so we stopped screwing up one or the other [13:17:18] <_joe_> but it added some indirection to the puppet code [13:17:53] <_joe_> the other issue is me using the fancy new ipaddress_primary as soon as it was out of the box :/ [13:19:16] well can't blame you for that [13:19:19] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [13:19:43] I am impressed about how quickly this was resolved, it could have been much worse [13:19:51] <_joe_> I'm adding a check-graphite check for session_loss at the very least [13:20:16] ther is already that [13:20:23] oh, a check [13:20:25] and me failing to check if it was actually used in code review (I was too concentrated on testing the next related CR probably) [13:20:25] sorry [13:25:01] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.tag_summary doesnt exist on query. Default database: bawiktionary. [Query snipped] [13:26:03] so, given that $facts['inexistent_fact'] didn't return an error, probably we should validate where we use it and fail if it's empty, as a general rule, probably making an helper function that does that [13:26:11] you can ignore anything replication related on eqiad this week (we cannot downtime all of them, because we need to fix them anyway) [13:28:14] I will fix that error [13:28:17] and the next ones to come [13:28:19] no [13:28:20] zeljkof, Amir1 and jan_drewniak: SWAT can go on now. Thanks for waiting [13:28:21] I am on it [13:28:23] ah [13:28:24] ok [13:28:25] it is s3 [13:28:30] volans: thanks! [13:28:31] there will be quite a lot :( [13:28:32] my problem [13:28:34] I know [13:28:43] okay, thanks [13:28:53] tag_summary is the alter i did, so it is my problem technically :) [13:28:55] Amir1: are you deploying your commit, or both? [13:29:00] oh [13:29:03] both if that's okay [13:29:09] ok, then [13:29:11] Amir1: sure, go ahead [13:29:26] jynus: I will take care of it [13:29:53] just let's coordinate so we do not step onto each other's steps [13:30:05] _joe_, volans: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170427-redis-ipaddress [13:30:25] jynus: sure [13:30:31] we should add a filter on those non-existent databases [13:30:33] I will fix the tag_summary and change_tag ones [13:30:35] like on labs [13:30:35] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349983 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:30:37] _joe_, volans: feel free to amend as needed, I'm definitely missing the "alerts were for the slaves, but the real failure was on the masters which didn't even alert" [13:30:48] ok [13:31:07] <_joe_> paravoid: ok [13:31:52] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349983 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:32:01] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:32:05] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349983 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:34:30] jan_drewniak: It's live in mwdebug1002 [13:35:01] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:36:55] (03Abandoned) 10Elukey: [WIP] Replace mc1001 with mc1019 [puppet] - 10https://gerrit.wikimedia.org/r/336972 (owner: 10Elukey) [13:37:06] Amir1: Hmm, mwdebu1002 doesn't look any different to me, let me make sure the latest patch is actually in the repo [13:37:19] (03CR) 10Elukey: "Pcc looks good: https://puppet-compiler.wmflabs.org/6245/" [puppet] - 10https://gerrit.wikimedia.org/r/350549 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [13:37:43] jan_drewniak: do I need to run the script even for mwdebug too? [13:37:58] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T163690#3217466 (10fgiunchedi) [13:38:27] 06Operations, 06Labs, 10wikitech.wikimedia.org: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3217467 (10Andrew) I updated the password so that's all good now. [13:40:11] Amir1: maybe? [13:40:48] From what I can see in the bash file it seems it's for full deployment [13:40:49] * Amir1 https://github.com/wikimedia/portals/blob/master/sync-portals [13:41:28] Amir1: is the submodule on the latest commit? [13:41:54] wait a sec [13:42:04] jan_drewniak: please test again [13:42:21] Amir1: yay! [13:42:22] sorry, forgot to do git submodule update [13:42:58] jan_drewniak: so that means "go live everywhere"? [13:43:03] (03PS1) 10Giuseppe Lavagetto: graphite::alerts: add alerting on session loss [puppet] - 10https://gerrit.wikimedia.org/r/350555 [13:43:07] Amir1: yup [13:43:22] <_joe_> volans: ^^ [13:43:33] <_joe_> elukey too [13:43:37] looking [13:43:45] !log ladsgroup@naos:/srv/mediawiki-staging$ portals/sync-portals (T128546) [13:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:53] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [13:44:32] (03CR) 10jerkins-bot: [V: 04-1] graphite::alerts: add alerting on session loss [puppet] - 10https://gerrit.wikimedia.org/r/350555 (owner: 10Giuseppe Lavagetto) [13:44:42] !log ladsgroup@naos Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 01m 21s) [13:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:46] <_joe_> meh how my local linter didn't get that [13:45:46] (03PS2) 10Giuseppe Lavagetto: graphite::alerts: add alerting on session loss [puppet] - 10https://gerrit.wikimedia.org/r/350555 [13:45:47] !log ladsgroup@naos Synchronized portals: (no justification provided) (duration: 01m 05s) [13:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:06] Done [13:46:12] check please jan_drewniak [13:46:40] Amir1: looks good, thanks! [13:46:48] _joe_: <3 [13:47:05] no, to mine [13:47:08] *now [13:47:44] (03CR) 10Ladsgroup: [C: 032] Set echoIcon for notification of wikibase in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350418 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [13:47:48] (03PS1) 10Filippo Giunchedi: grafana: break down HTTP 499 in swift [puppet] - 10https://gerrit.wikimedia.org/r/350556 [13:47:55] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350418 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [13:48:27] <_joe_> paravoid: I'm quite embarassed we don't have more alerts, but tbh mediawiki reporting to graphite is spotty [13:48:43] (03Merged) 10jenkins-bot: Set echoIcon for notification of wikibase in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350418 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [13:48:52] (03CR) 10jenkins-bot: Set echoIcon for notification of wikibase in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350418 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [13:49:08] _joe_: can't we do the same (or an additional) check without graphite? [13:49:22] just issuing a redis command to the right redis servers [13:49:32] in this case even checking if the port is open would have been enough [13:49:49] <_joe_> paravoid: yes, that's on me for making the checks nrpe checks [13:51:43] <_joe_> that honestly is the best option to check for replication etc [13:52:04] <_joe_> tbh, we got paged by MediaWiki fatals when the issue got big [13:52:10] !log start of scap sync-file wmf-config/Wikibase-production.php 'SWAT: Set echoIcon for notification of wikibase in test wikis (T142102)' [13:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:19] T142102: [Story] Deploy Wikibase notifications to Wikimedia projects - https://phabricator.wikimedia.org/T142102 [13:52:39] _joe_ wouldn't it be better to use .session_loss.rate? [13:52:57] !log ladsgroup@naos Synchronized wmf-config/Wikibase-production.php: SWAT: Set echoIcon for notification of wikibase in test wikis (T142102) (duration: 00m 57s) [13:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:14] <_joe_> elukey: yes of course, brainfart [13:53:52] rest looks good! We might tune the alarms in the future but LGTM [13:53:54] (03PS3) 10Giuseppe Lavagetto: graphite::alerts: add alerting on session loss [puppet] - 10https://gerrit.wikimedia.org/r/350555 [13:55:31] PROBLEM - Nginx local proxy to apache on mw2213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] PROBLEM - HHVM rendering on mw2213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:41] PROBLEM - Apache HTTP on mw2213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:57] <_joe_> uhm [13:55:58] <_joe_> checking [13:56:03] <_joe_> what a fun afternoon! [13:57:13] <_joe_> !log restarting HHVM on mw2213, stuck in HPHP::Treadmill::getAgeOldestRequest [13:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:01] * paravoid wonders if $slaveof = ipresolve($master, 4) has the same potential of causing an outage [13:58:40] <_joe_> no [13:58:57] <_joe_> ipresolve will make the puppet catalog fail if it fails [13:59:20] hm, yeah, good point [13:59:21] RECOVERY - Nginx local proxy to apache on mw2213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.029 second response time [13:59:21] RECOVERY - HHVM rendering on mw2213 is OK: HTTP OK: HTTP/1.1 200 OK - 73275 bytes in 0.101 second response time [13:59:31] RECOVERY - Apache HTTP on mw2213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.021 second response time [13:59:54] <_joe_> I have to admit I didn't expect [13:59:56] <_joe_> puppet apply -e 'notice($facts["pinkunicorn"])' [14:00:01] <_joe_> to just work [14:00:08] why not [14:00:27] even notice($::pinkunicorn) would work right? [14:00:34] <_joe_> no that wouldn't [14:00:35] it'd warn, but still compile/work [14:00:56] (03PS9) 10Elukey: Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) [14:00:58] <_joe_> well not in puppet 3, right :) [14:01:08] <_joe_> in puppet 4 that's a fatal error [14:01:21] <_joe_> if you configure it to [14:02:02] faidon@d-i-test:~$ puppet apply -e 'notice("foo is ${::pinkunicorn}")' [14:02:05] Warning: Scope(Class[main]): Could not look up qualified variable '::pinkunicorn'; [14:02:08] Notice: Scope(Class[main]): foo is [14:02:11] Notice: Compiled catalog for d-i-test.eqiad.wmnet in environment production in 0.01 seconds [14:02:14] Notice: Finished catalog run in 0.02 seconds [14:04:01] <_joe_> if that's puppet 4, try setting strict_variables = true [14:04:16] <_joe_> yes, strict variables are a thing in puppet [14:04:16] faidon@d-i-test:~$ puppet apply -e 'notify { 'foo': message => "foo is ${::pinkunicorn}" }' [14:04:18] I've added some info to the incident report timeline, conclusions and actionables, please review/modify them [14:04:19] Warning: Scope(Class[main]): Could not look up qualified variable '::pinkunicorn'; [14:04:22] Notice: Compiled catalog for d-i-test.eqiad.wmnet in environment production in 0.02 seconds [14:04:25] Notice: foo is [14:04:28] Notice: /Stage[main]/Main/Notify[foo]/message: defined 'message' as 'foo is ' [14:04:31] Notice: Finished catalog run in 0.02 seconds [14:04:33] that's puppet 3 [14:04:46] !log Deploy alter table labswiki.revision on silver - T132416 [14:04:48] we're pretty far off puppet 4 + strict_variables [14:04:50] <_joe_> yes, I know, in puppet 3 it's just a warning [14:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:54] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [14:05:00] ah ok, I misread you above [14:05:15] <_joe_> yeah I tried to clarify but not very well :P [14:05:21] 17:00 < paravoid> it'd warn, but still compile/work [14:05:23] 17:00 < _joe_> well not in puppet 3, right :) [14:05:41] <_joe_> yeah I meant it won't fail in puppet 3 [14:05:47] yeah got you now [14:06:12] <_joe_> it's the damn lag between when you start writing something and someone concludes their sentence [14:06:25] <_joe_> that creates context failures like that [14:06:33] 06Operations, 10Monitoring: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996#3217578 (10fgiunchedi) [14:08:22] !log Deploy alter table labswiki.revision on labtestweb2001 - T132416 [14:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:20] _joe_: sigh that session loss alert reminded me about another patch and task we've never merged, https://gerrit.wikimedia.org/r/#/c/256422 I'll comment on yours [14:15:48] (03PS1) 10Ema: Release 4.1.5-1wm4 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350561 (https://phabricator.wikimedia.org/T145661) [14:16:43] (03CR) 10Filippo Giunchedi: "This would supersede https://gerrit.wikimedia.org/r/#/c/25642 and related T108985, though 'MediaWiki.edit.failures.bad_token.count' should" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350555 (owner: 10Giuseppe Lavagetto) [14:18:53] (03CR) 10BBlack: [C: 031] Release 4.1.5-1wm4 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350561 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [14:20:59] (03CR) 10Ema: [V: 032 C: 032] Release 4.1.5-1wm4 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/350561 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [14:22:25] 06Operations, 10ops-eqiad, 15User-fgiunchedi: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3217642 (10fgiunchedi) Looks like now battery count is reported as zero and `Cache Status: Permanently Disabled` plus `Cache Status Details: Cable Error` are still active, though the hp... [14:25:59] (03CR) 10Elukey: [C: 032] Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) (owner: 10Elukey) [14:26:35] !log varnish 4.1.5-1wm4 uploaded to apt.w.o T145661 [14:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:44] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [14:28:37] marostegui: jynus: hello just got a call from the Dell tech he wil be onstie for main baord and memery replacemnet on es2019 anytime between now and 1pm please take es2019 down for me thanks [14:28:41] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 3128: Connection refused [14:28:52] papaul: will do thank you [14:29:00] marostegui: thanks [14:29:11] PROBLEM - puppet last run on bohrium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:29:54] !log Stop MySQL and shutdown es2019 for HW replacement - T149526 [14:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:03] T149526: es2019 crashed again - https://phabricator.wikimedia.org/T149526 [14:30:36] bohrium is due to me, sigh [14:31:09] papaul: the host is now down [14:31:12] uh, looking into cp1008 [14:31:46] (03PS4) 10Volans: Puppet compiler: sync newest facts only [puppet] - 10https://gerrit.wikimedia.org/r/335670 (https://phabricator.wikimedia.org/T157052) [14:31:52] oh, I've upgraded varnish there without running the puppet agent, fixing [14:33:41] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 176 bytes in 0.072 second response time [14:34:49] !log upgrade upload-codfw to varnish 4.1.5-1wm4 T145661 [14:34:56] marostegui: thanks [14:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:58] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [14:36:22] (03PS2) 10Eevans: Blacklist `type=Table` metrics [puppet] - 10https://gerrit.wikimedia.org/r/350485 (https://phabricator.wikimedia.org/T163936) [14:36:41] (03CR) 10Giuseppe Lavagetto: [C: 031] Puppet compiler: sync newest facts only [puppet] - 10https://gerrit.wikimedia.org/r/335670 (https://phabricator.wikimedia.org/T157052) (owner: 10Volans) [14:37:55] 06Operations, 10Monitoring: check_hpssacli should report on battery failures and cache disabled - https://phabricator.wikimedia.org/T163998#3217675 (10fgiunchedi) [14:39:18] (03CR) 10Filippo Giunchedi: [C: 031] Update collector version to 3.1.4 [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/350503 (https://phabricator.wikimedia.org/T163936) (owner: 10Eevans) [14:43:11] RECOVERY - puppet last run on bohrium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:43:54] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3217714 (10Papaul) p:05High>03Normal [14:47:30] (03PS1) 10Elukey: Add quotes to the Piwik configuration file [puppet] - 10https://gerrit.wikimedia.org/r/350567 (https://phabricator.wikimedia.org/T159136) [14:48:33] (03PS2) 10Filippo Giunchedi: grafana: break down HTTP 499 in swift [puppet] - 10https://gerrit.wikimedia.org/r/350556 [14:50:13] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service: Re-enable ORES data in action API - https://phabricator.wikimedia.org/T163687#3217755 (10Halfak) [14:51:55] (03PS1) 10Faidon Liambotis: redis: make redis_get_instances() fail when empty [puppet] - 10https://gerrit.wikimedia.org/r/350568 [14:51:58] (03CR) 10Elukey: [C: 032] Add quotes to the Piwik configuration file [puppet] - 10https://gerrit.wikimedia.org/r/350567 (https://phabricator.wikimedia.org/T159136) (owner: 10Elukey) [14:53:08] (03CR) 10Giuseppe Lavagetto: [C: 031] redis: make redis_get_instances() fail when empty [puppet] - 10https://gerrit.wikimedia.org/r/350568 (owner: 10Faidon Liambotis) [14:54:11] (03CR) 10Paladox: [C: 04-1] "> We should set this to HTTP, not LDAP. I want to keep using the" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [14:54:23] (03PS3) 10Paladox: Gerrit: Set gitBasicAuthPolicy = LDAP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [14:55:15] (03CR) 10Paladox: "If we do HTTP_LDAP, this can be merged for 2.13, but we first need to upgrade as one of the 2.13.x releases added support for this but was" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [14:56:57] (03CR) 10Paladox: "> can you add some more details what this does / a link to some docs" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [15:04:06] (03PS1) 10Elukey: Fix Piwik erb config template [puppet] - 10https://gerrit.wikimedia.org/r/350573 (https://phabricator.wikimedia.org/T159136) [15:05:59] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3217866 (10Andrew) Looks good to me. I added clarification about short-term failovers since that's the most frequent use case: https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource%3ATools... [15:06:25] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3217867 (10Andrew) a:05Andrew>03chasemp [15:08:06] akosiaris: _joe_ : just to confirm, scb2005 and scb2006 are up and running, right? [15:08:21] (03CR) 10Elukey: [C: 032] Fix Piwik erb config template [puppet] - 10https://gerrit.wikimedia.org/r/350573 (https://phabricator.wikimedia.org/T159136) (owner: 10Elukey) [15:08:33] it seems that was the reason for the yesterday's outage [15:13:10] (03PS1) 10Elukey: Set trusted hosts for Piwik in its profile [puppet] - 10https://gerrit.wikimedia.org/r/350576 (https://phabricator.wikimedia.org/T159136) [15:14:09] 06Operations, 10Ops-Access-Requests, 10Deployment-Systems: Enable keyholder for ORES deployments - https://phabricator.wikimedia.org/T163939#3217925 (10Halfak) Oh! How do I come to know the appropriate passphrase? [15:16:52] <_joe_> Amir1: what exactly is the reason for the outage? [15:17:19] _joe_: our scap config does not include scb2005 and 2006 [15:17:23] (03CR) 10Chad: [C: 031] "No, I don't want to fall back to LDAP. Having a specific password for this is pretty much standard practice -- Phabricator does the same. " [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [15:17:28] <_joe_> Amir1: sigh, that again? [15:17:52] (03CR) 10Paladox: [C: 031] "Ok, then this can be merged when ever :)" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [15:17:52] so when requests comes in and then goes to uwsgi then goes to redis to worker they don't match [15:17:56] <_joe_> who thought it was a good idea to put host lists inside of the scap repos, was, well, wrong [15:18:21] db-eqiad.php ? [15:18:31] :))) It is getting more obvious [15:18:54] <_joe_> Amir1: it was obvious from day 0 to me :P [15:19:04] <_joe_> anyways, I'm off for today, see you all tomorrow [15:19:12] have a good one [15:19:49] (03CR) 10Elukey: [C: 032] Set trusted hosts for Piwik in its profile [puppet] - 10https://gerrit.wikimedia.org/r/350576 (https://phabricator.wikimedia.org/T159136) (owner: 10Elukey) [15:22:11] !log stopping all replication channels on dbstore1001 for topology changes [15:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:10] !log Upgrade db1090 mariadb from 10.0.23 to 10.0.28 [15:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:56] !log Upgrade db1089 mariadb from 10.0.23 to 10.0.28 [15:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:29] (03PS1) 10Ladsgroup: Fix echoIcon for wikibase in testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350584 (https://phabricator.wikimedia.org/T142102) [15:41:42] 06Operations, 06Labs: During labservices1001 failover fqdn changed from foo.project.eqiad.wmflabs to foo.eqiad.wmflabs - https://phabricator.wikimedia.org/T163823#3218042 (10Andrew) When the metaservice is fully down then puppet errors out. If instead the metadata service is responding and returning an empty... [15:41:59] hrm, FWIW there is a task for a canonical target list that reads from a different source apart from an individual scap repo here https://phabricator.wikimedia.org/T148992 [15:43:02] ^ Amir1 may see if the proposal there there would have prevented the ORES problems [15:43:57] Thanks I will look it [15:44:51] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1036 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:44:52] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1043 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:44:52] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1068 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:44:52] PROBLEM - Check systemd state on analytics1068 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:44:52] PROBLEM - Check systemd state on analytics1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:44:52] PROBLEM - Check systemd state on analytics1067 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:44:52] PROBLEM - Check systemd state on analytics1043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:45:01] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1067 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:45:12] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1035 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:45:12] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1044 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:45:12] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1042 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:45:12] PROBLEM - Check systemd state on analytics1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:45:12] PROBLEM - Check systemd state on analytics1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:45:12] PROBLEM - Check systemd state on analytics1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:45:12] PROBLEM - Check systemd state on analytics1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:45:13] PROBLEM - Check systemd state on analytics1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:45:13] PROBLEM - Check systemd state on analytics1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:45:18] elukey: ^^^ [15:45:41] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1037 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:45:41] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:45:41] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1045 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:46:14] gneee [15:46:22] I just removed the downtime, they were all gree [15:46:24] *green [15:46:38] !log Upgrade db1091 mariadb from 10.0.23 to 10.0.28 [15:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:50] elukey: Active: failed (Result: exit-code) since Wed 2017-04-26 16:51:42 UTC; 22h ago [15:47:05] maybe you needed to clear the systemctl status? [15:47:11] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.00 seconds [15:47:12] or it's a failure [15:47:27] checking now [15:47:39] checking db1048 [15:47:44] I mean the 'reset-failed' [15:48:03] Current Cache Policy: WriteThrough [15:49:16] DNS query for 'analytics1003.eqiad.wmnet' failed: query timed out ? [15:49:41] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1002 is OK: OK ferm input default policy is set [15:49:46] forced ferm and it started correctly [15:49:48] grrr [15:50:11] RECOVERY - Check systemd state on analytics1002 is OK: OK - running: The system is fully operational [15:50:12] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 10.00 seconds [15:50:17] (03PS1) 10Andrew Bogott: openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350587 [15:50:41] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1037 is OK: OK ferm input default policy is set [15:50:42] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1045 is OK: OK ferm input default policy is set [15:50:52] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1043 is OK: OK ferm input default policy is set [15:50:52] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1036 is OK: OK ferm input default policy is set [15:50:52] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1068 is OK: OK ferm input default policy is set [15:50:52] RECOVERY - Check systemd state on analytics1068 is OK: OK - running: The system is fully operational [15:50:52] RECOVERY - Check systemd state on analytics1044 is OK: OK - running: The system is fully operational [15:50:52] RECOVERY - Check systemd state on analytics1067 is OK: OK - running: The system is fully operational [15:50:52] RECOVERY - Check systemd state on analytics1043 is OK: OK - running: The system is fully operational [15:50:55] !log forced 'service ferm start' on the failed analytics hosts [15:50:57] !log disabling labs instance create/delete to avoid hilarity during network maintenance [15:51:01] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1067 is OK: OK ferm input default policy is set [15:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:11] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1044 is OK: OK ferm input default policy is set [15:51:11] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1035 is OK: OK ferm input default policy is set [15:51:11] RECOVERY - Check systemd state on analytics1037 is OK: OK - running: The system is fully operational [15:51:11] RECOVERY - Check systemd state on analytics1036 is OK: OK - running: The system is fully operational [15:51:11] RECOVERY - Check systemd state on analytics1045 is OK: OK - running: The system is fully operational [15:51:30] (03CR) 10Andrew Bogott: [C: 032] openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350587 (owner: 10Andrew Bogott) [15:51:47] elukey: the one I checked failed yesterday though [15:52:16] volans: yes the downtime was up to EOF today, I didn't remove it yesterday (it was 11 PM and I forgot :P) [15:52:35] ah ok [15:52:51] yeah my bad [15:53:15] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218132 (10Marostegui) This has happened again: `˜/icinga-wm 17:47> PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.00 seconds`... [15:54:26] (03Abandoned) 10Chad: Scap clean: exclude .git directories on first pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347633 (owner: 10Chad) [15:55:03] and now I think jmx trans in there might not be working correctly [15:55:11] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1042 is OK: OK ferm input default policy is set [15:55:11] RECOVERY - Check systemd state on analytics1042 is OK: OK - running: The system is fully operational [15:56:16] !log restart of jmxtrans on all the hadoop worker nodes [15:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:29] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218173 (10Marostegui) And it recovered: ``` root@db1048:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 4058 mV Current: 152 mA Temperature: 33 C... [15:57:43] 06Operations, 10Ops-Access-Requests, 10Deployment-Systems: Enable keyholder for ORES deployments - https://phabricator.wikimedia.org/T163939#3218181 (10Dzahn) You should not need the passphrase since the key is already loaded: ``` [naos:~] $ sudo keyholder status keyholder-agent: active .. - 2048 6d:54:92:... [16:00:11] RECOVERY - Check systemd state on analytics1035 is OK: OK - running: The system is fully operational [16:02:43] (03CR) 10Dzahn: "commit message and actual code change are different" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [16:03:03] (03PS4) 10Paladox: Gerrit: Set gitBasicAuthPolicy = HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [16:03:05] ahhh okok jmxtrans is saying java.nio.channels.UnresolvedAddressException [16:03:11] (03PS5) 10Paladox: Gerrit: Set gitBasicAuthPolicy = HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [16:03:12] this is why of those alerts in icinga [16:07:58] (03PS3) 10Eevans: Blacklist `type=Table` metrics [puppet] - 10https://gerrit.wikimedia.org/r/350485 (https://phabricator.wikimedia.org/T163936) [16:09:54] urandom: gah, sorry missed puppet swat, starting now [16:10:02] godog: no worries [16:10:36] (03PS6) 10Paladox: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [16:10:41] (03CR) 10Filippo Giunchedi: [C: 032] Blacklist `type=Table` metrics [puppet] - 10https://gerrit.wikimedia.org/r/350485 (https://phabricator.wikimedia.org/T163936) (owner: 10Eevans) [16:12:27] urandom: merged [16:13:08] godog: yup; thanks! [16:13:44] (03PS7) 10Paladox: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [16:14:14] (03PS1) 10Chad: scap clean: Use list to delete, rather than list to keep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350592 [16:15:06] godog, elukey: and it looks fine fwiw [16:15:32] elukey: i didn't apply it to any aqs hosts (i don't have the karma) [16:15:34] Ah crud, I forgot puppetswat was now [16:15:46] Can I sneak one in? [16:15:49] :P [16:15:50] elukey: but it's coming as puppet runs normally :) [16:16:15] RainbowSprinkles: which? :) [16:16:17] 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 racking and onsite setup tasks - https://phabricator.wikimedia.org/T164011#3218244 (10Papaul) [16:16:39] godog: https://gerrit.wikimedia.org/r/#/c/350246/ [16:16:45] 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3218261 (10Papaul) [16:17:06] elukey: oh, and cassandra-metrics-collector.service will need a restart to apply (apparently puppet isn't doing that for changes to the filter config) [16:17:16] sure [16:18:39] RainbowSprinkles: sure [16:18:59] (03PS3) 10Filippo Giunchedi: Planet: Remove Fedora People / planetsun from fellow planet listing [puppet] - 10https://gerrit.wikimedia.org/r/350246 (owner: 10Chad) [16:19:19] godog: Thx, I added it to the deployment calendar too so it's Official [16:19:38] RainbowSprinkles: haha thanks, I was about to ask the same [16:19:48] (03PS8) 10Paladox: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [16:19:55] (03PS9) 10Paladox: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [16:20:42] * godog shakes fist while waiting for jenkins [16:21:16] 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3218267 (10Volans) [16:21:33] (03CR) 10Filippo Giunchedi: [C: 032] Planet: Remove Fedora People / planetsun from fellow planet listing [puppet] - 10https://gerrit.wikimedia.org/r/350246 (owner: 10Chad) [16:21:33] 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3181168 (10Volans) I've added a few more that I saw today in https://puppet-compiler.wmflabs.org/6247/ [16:22:28] RainbowSprinkles: {{done}} nothing else to do I think or cache is involved? [16:22:46] I dunno tbh [16:23:26] (03CR) 10Thcipriani: [C: 031] scap clean: Use list to delete, rather than list to keep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350592 (owner: 10Chad) [16:23:41] RainbowSprinkles: ok I'll check back tomorrow morning, if it still there I'll poke it [16:26:05] sounds good to me [16:30:01] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:31:38] (03PS10) 10Dzahn: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [16:32:16] !log unbanning elasticsearch servers in eqiad row D - elastic10(17|18|19|20) - T148506 [16:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:55] T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 [16:34:33] (03CR) 10Dzahn: "I doubt that this is correct: "This changes it from using your ldap password to a random generated password". Isn't it the HTTP password w" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [16:34:48] ACKNOWLEDGEMENT - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% andrew bogott switch maintenance [16:36:21] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [16:36:23] (03CR) 10Paladox: [C: 031] "> I doubt that this is correct: "This changes it from using your ldap" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [16:37:21] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [16:42:31] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms [16:42:42] !log demon@naos Pruned MediaWiki: 1.29.0-wmf.19 [keeping static files] (duration: 00m 15s) [16:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:39] (03CR) 10Chad: [C: 032] scap clean: Use list to delete, rather than list to keep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350592 (owner: 10Chad) [16:45:52] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-nodemanager] [16:45:55] !log re-enabling labs instance creation/deletion [16:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:12] checking 1041 [16:46:15] (03PS1) 10Andrew Bogott: Revert "openstack: Temporarily disable self service instance creation and deletion on horizon" [puppet] - 10https://gerrit.wikimedia.org/r/350603 [16:46:21] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-nodemanager] [16:46:21] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-nodemanager] [16:46:22] (03Merged) 10jenkins-bot: scap clean: Use list to delete, rather than list to keep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350592 (owner: 10Chad) [16:46:30] (03CR) 10jenkins-bot: scap clean: Use list to delete, rather than list to keep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350592 (owner: 10Chad) [16:46:31] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-nodemanager] [16:46:52] these are mine, removed downtime [16:47:20] weird since yarn is running [16:47:52] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:48:10] (03PS2) 10Andrew Bogott: Revert "openstack: Temporarily disable self service instance creation and deletion on horizon" [puppet] - 10https://gerrit.wikimedia.org/r/350603 [16:48:47] !log demon@naos Synchronized scap/plugins/clean.py: --keep-static is nice now. Also need a co-master sync (duration: 01m 28s) [16:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:00] 06Operations, 10ops-codfw: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3218335 (10RobH) a:05Papaul>03akosiaris So @akosiaris should review and approve of the racking layout, since he is the ganeti expert! Alex: On these new nodes, do you want the... [16:50:17] (03CR) 10Andrew Bogott: [C: 032] Revert "openstack: Temporarily disable self service instance creation and deletion on horizon" [puppet] - 10https://gerrit.wikimedia.org/r/350603 (owner: 10Andrew Bogott) [16:50:21] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:52:11] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:52:18] (03CR) 10Dzahn: "It changes it from the default behaviour in 2.14, but it doesn't change anything about our _current_ setup. I think it's misleading that w" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [16:53:06] !log unbanning all elasticsearch servers in eqiad row D - T148506 [16:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:14] T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 [16:53:52] (03CR) 10Paladox: [C: 031] "> It changes it from the default behaviour in 2.14, but it doesn't" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [16:53:59] 06Operations, 10ops-codfw, 10DBA: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3218359 (10Papaul) Main board replacement DIMM B4 Replaced DIMM A1 Replaced BIOS update from 2.2.5 to 2.4.3 [16:54:10] 06Operations, 10ops-codfw, 10DBA: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3218360 (10Papaul) a:05Papaul>03Marostegui [16:55:19] (03CR) 10Dzahn: "i stick with it. "This changes it from using your ldap password" is not what this change does to our current Gerrit setup." [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [16:55:45] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.31 and port 8000: No route to host [16:56:14] XioNoX: ---^ [16:56:20] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1210.eqiad.wmnet because of too many down!: ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [16:56:30] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:56:33] (03CR) 10Volans: "Last run of puppet compiler available at:" [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:56:37] what's up, that's not good if ocg1002 dies too [16:56:42] D3 has some ocg hosts [16:56:42] while ocg1001 is still in repair [16:56:42] D3: test - ignore - https://phabricator.wikimedia.org/D3 [16:56:52] ah [16:57:01] mutante: network maintenance for the rack with ocg100[23] [16:57:07] but it should have been done a while ago [16:57:10] elukey: gotcha, thanks [16:57:57] (03PS1) 10Chad: scap clean: Swap --keep-static for --delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350605 [16:58:26] elukey: what's the issue? [16:59:10] ocg.svc.eqiad.wmnet seems to be down due to hosts failing health checks (ocg100[23] that were in D3) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T1700). Please do the needful. [17:00:15] ORES has one deploy [17:00:29] I'll check as well ocg / lvs [17:00:29] no parsoid deploy today [17:00:38] * halfak is monitoring [17:01:23] (03CR) 10Thcipriani: [C: 031] scap clean: Swap --keep-static for --delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350605 (owner: 10Chad) [17:01:50] (03CR) 10Chad: [C: 032] scap clean: Swap --keep-static for --delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350605 (owner: 10Chad) [17:02:14] (03PS6) 10Volans: Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [17:02:50] (03Merged) 10jenkins-bot: scap clean: Swap --keep-static for --delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350605 (owner: 10Chad) [17:02:52] (03CR) 10Volans: "I've restored patch set 2" [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [17:03:03] (03CR) 10jenkins-bot: scap clean: Swap --keep-static for --delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350605 (owner: 10Chad) [17:04:12] (03PS18) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [17:04:25] (03CR) 10jerkins-bot: [V: 04-1] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (owner: 10Gehel) [17:04:32] !log demon@naos Synchronized scap/plugins/clean.py: One last fix (duration: 01m 04s) [17:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:55] (03PS15) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [17:05:00] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1237.eqiad.wmnet because of too many down!: ocg_8000 - Could not depool server ocg1003.eqiad.wmnet because of too many down! [17:05:08] (03CR) 10jerkins-bot: [V: 04-1] maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [17:05:19] 06Operations, 06Labs, 13Patch-For-Review: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402#3218381 (10Andrew) a:05Andrew>03None The particular maintenance that prompted this task is now complete. We still need to improve here, though. [17:08:30] !log demon@naos Pruned MediaWiki: 1.29.0-wmf.16 (duration: 00m 13s) [17:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:55] <_joe_> !log stop pybal on lvs1006 to stop announcing via BGP [17:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:49] (03PS2) 10Addshore: wmgUseGettingStarted true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347820 [17:11:50] PROBLEM - pybal on lvs1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [17:12:17] !log demon@naos Pruned MediaWiki: 1.29.0-wmf.18 [keeping static files] (duration: 00m 20s) [17:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:03] !log ladsgroup@naos:/srv/deployment/ores/deploy$ scap deploy (T163950) [17:15:06] !log ladsgroup@naos Started deploy [ores/deploy@68cca85]: (no justification provided) [17:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:11] T163950: Investigate failed deploy to CODFW - https://phabricator.wikimedia.org/T163950 [17:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:14] (03PS3) 10Andrew Bogott: Designate: Allow labs clients to access the designate API. [puppet] - 10https://gerrit.wikimedia.org/r/349531 (https://phabricator.wikimedia.org/T45580) [17:21:13] (03PS11) 10Paladox: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [17:23:26] (03CR) 10Andrew Bogott: [C: 032] Designate: Allow labs clients to access the designate API. [puppet] - 10https://gerrit.wikimedia.org/r/349531 (https://phabricator.wikimedia.org/T45580) (owner: 10Andrew Bogott) [17:27:34] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 467 bytes in 0.074 second response time [17:30:08] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:30:29] (03PS1) 10Andrew Bogott: Designate api: Fix typo in a ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/350608 [17:30:47] <_joe_> !log started pybal on lvs1006 after network was fixed [17:30:48] RECOVERY - pybal on lvs1006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [17:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:23] (03CR) 10Andrew Bogott: [C: 032] Designate api: Fix typo in a ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/350608 (owner: 10Andrew Bogott) [17:32:28] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [17:36:36] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3218491 (10ayounsi) All servers have been moved, confirmed no more interfaces are up on asw. and no more traffic (other than multicast) on the as... [17:36:56] !log ladsgroup@naos Finished deploy [ores/deploy@68cca85]: (no justification provided) (duration: 21m 50s) [17:36:57] (03PS2) 10Legoktm: Create view for "linter" table on Labs [puppet] - 10https://gerrit.wikimedia.org/r/348201 (https://phabricator.wikimedia.org/T160611) [17:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:12] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3218493 (10ayounsi) [17:38:09] (03PS3) 10Andrew Bogott: Create view for "linter" table on Labs [puppet] - 10https://gerrit.wikimedia.org/r/348201 (https://phabricator.wikimedia.org/T160611) (owner: 10Legoktm) [17:39:30] (03CR) 10Dzahn: "alright, we just talked about this on IRC for a while. summary: Paladox is right that the docs say what he says, they claim for our curren" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [17:43:21] (03CR) 10Andrew Bogott: [C: 032] Create view for "linter" table on Labs [puppet] - 10https://gerrit.wikimedia.org/r/348201 (https://phabricator.wikimedia.org/T160611) (owner: 10Legoktm) [17:43:54] thanks andrewbogott :) [17:45:45] legoktm: still a bit to do to get the change active [17:45:53] ok [17:49:55] legoktm: which requires the dbas who might not be available until tomorrow. I'll make a note on the bug [17:50:11] ok, I'm not really in a rush [17:50:31] (03PS7) 10Dzahn: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [17:54:38] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3218617 (10Cmjohnson) The disk has been replaced [17:55:17] 06Operations, 13Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3218630 (10Cmjohnson) @Dzahn yes, the gateway had a typo...all is well [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T1800). [18:00:04] Amir1 and addshore: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:12] o/ [18:00:15] \o [18:00:16] \o [18:01:06] Right, so, it looks like I'm doing this! [18:01:27] (03CR) 10Addshore: [C: 032] Fix echoIcon for wikibase in testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350584 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [18:01:29] (03PS2) 10Addshore: Fix echoIcon for wikibase in testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350584 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [18:01:36] (03CR) 10Addshore: [C: 032] Fix echoIcon for wikibase in testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350584 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [18:02:12] Amir1: is https://gerrit.wikimedia.org/r/#/c/350585/1 in master? [18:02:16] addshore: going to +2 mine now, as they depend on each other and gerrit can take awhile sometimes [18:02:37] ebernhardson: go ahead! :) [18:02:45] Amir1: Ahh yes, I see it! [18:02:54] addshore: it's in wmf.21 the master one is (I was looking for it) [18:02:57] :D [18:03:39] 06Operations, 13Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3218672 (10Dzahn) a:05Cmjohnson>03Dzahn cool, thank you. taking it back [18:04:09] (03PS1) 10Ladsgroup: Revert "cache::misc: switch ores from codfw to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/350622 [18:04:19] (03Merged) 10jenkins-bot: Fix echoIcon for wikibase in testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350584 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [18:04:25] (03CR) 10Ladsgroup: "Per my tests it's fixed now." [puppet] - 10https://gerrit.wikimedia.org/r/350622 (owner: 10Ladsgroup) [18:04:34] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3218678 (10Dzahn) a:05Cmjohnson>03Dzahn Thanks, i will do a reinstall later today. [18:04:35] mutante: hey, https://gerrit.wikimedia.org/r/#/c/350622/1 [18:05:12] Amir1: is the config patch okay to go straight out or do you want it on a debug server? [18:05:34] addshore: straight to all, it's for test wikis only and it's super hard to test [18:06:14] Amir1: syncing [18:06:38] 06Operations, 10ops-codfw, 10DBA: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3218689 (10Marostegui) Thanks @papaul! Let's see how it goes. [18:07:37] !log addshore@naos Synchronized wmf-config/Wikibase-production.php: SWAT [[gerrit:350584|Fix echoIcon for wikibase in testwikis]] (duration: 01m 27s) [18:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:46] Amir1: ^^ [18:07:54] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 10Scap (Scap3-MediaWiki-MVP): Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3218694 (10demon) So, our original usecase for conftool was {T125629} -- we shouldn't have to worry much about depooling the whole cluster... [18:07:56] Thanks [18:07:58] I check it [18:09:21] ebernhardson I'll do yours next (and come back for Amir1's core patch after) [18:10:41] 06Operations, 13Patch-For-Review: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3218722 (10Volans) [18:12:05] (03CR) 10jenkins-bot: Fix echoIcon for wikibase in testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350584 (https://phabricator.wikimedia.org/T142102) (owner: 10Ladsgroup) [18:12:08] ebernhardson: are you happy for both of your patches to go out together? [18:12:09] Amir1: a little bit of a bad moment for me to do that right now.. have to exit a train in 5 min, ok in about an hour? [18:12:19] addshore: yes they should go together [18:12:19] mutante: sure [18:12:23] Amir1: also, i learned i was supposed to do that in 2 steps [18:12:29] ebernhardson: ack! :) [18:12:51] addshore: it wont break anything, but one adds a way to adjust things, and the second uses that for a specific use case [18:12:54] Amir1: .. to avoid 50x during switchover [18:13:10] la la la *waits for jenkins* [18:15:06] Okay, they are all in :) [18:17:37] Amir1: ebernhardson your patches are on mwdebug1002 [18:17:52] !log restarting apache on iridium to hotfix T164005 [18:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:08] addshore: checking [18:18:37] works like a charm [18:18:45] Amir1: ack, will sync yours now [18:19:24] answers in 935 ms while in ordinary mode, it answers in 40 seconds [18:19:46] Amir1: syncing [18:20:39] !log addshore@naos Synchronized php-1.29.0-wmf.21/includes/api/ApiQueryPagePropNames.php: SWAT [[gerrit:350585|Do not add limit to ApiQueryPagePropNames when database type is mysql]] (duration: 01m 04s) [18:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:03] It seems that stashbot isen't showing the tasks when someone links to T [18:21:25] works as expected in everywhere [18:21:28] addshore: works fine i think but looks like i used the wrong profile name in the config patch that enables this (already shipped), one sec i'll have a wmf-config patch [18:21:44] ebernhardson: ack! [18:21:56] !log T163936: restarting cassandra-metrics-collector, restbase staging [18:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:04] T163936: Latency metrics missing - https://phabricator.wikimedia.org/T163936 [18:22:53] !log addshore@naos Synchronized php-1.29.0-wmf.21/extensions/WikimediaEvents/extension.json: SWAT [[gerrit:350544|WMDE Spring campaign - Remove hook]] PT1/2 (duration: 00m 57s) [18:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:21] (03PS1) 10EBernhardson: update name of sistersearch profile for wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350626 [18:23:25] addshore: ^ [18:23:41] ebernhardson: okay, just pushing out the second part of this other patch [18:23:43] !log T163936: restarting cassandra-metrics-collector, restbase production [18:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:43] !log addshore@naos Synchronized php-1.29.0-wmf.21/extensions/WikimediaEvents/WikimediaEventsHooks.php: SWAT [[gerrit:350544|WMDE Spring campaign - Remove hook]] PT2/2 (duration: 00m 52s) [18:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:20] (03CR) 10Addshore: [C: 032] update name of sistersearch profile for wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350626 (owner: 10EBernhardson) [18:25:34] ebernhardson: could you add that config patch to the swat calender too please? :) [18:25:39] addshore: sure [18:26:43] addshore: lemme know when its on mwdebug1002 [18:26:56] (03Merged) 10jenkins-bot: update name of sistersearch profile for wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350626 (owner: 10EBernhardson) [18:27:05] (03CR) 10jenkins-bot: update name of sistersearch profile for wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350626 (owner: 10EBernhardson) [18:28:02] ebernhardson: it is there now [18:29:31] addshore: looks good, thanks! [18:29:44] okay, syncing the config change now then will sync the rest :) [18:30:42] sounds good [18:31:09] !log addshore@naos Synchronized wmf-config/CirrusSearch-common.php: SWAT [[gerrit:350626|update name of sistersearch profile for wikivoyage]] (duration: 00m 49s) [18:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:36] syncing [18:34:09] !log addshore@naos Synchronized php-1.29.0-wmf.21/extensions/CirrusSearch: SWAT [[gerrit:350614|#1]] [[gerrit:350615|#2]] (duration: 00m 59s) [18:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:34] ebernhardson: ^^ please check! [18:36:32] addshore: seems good, thanks! [18:36:44] Awesome! right, moving onto my patches! [18:37:15] (03PS2) 10Addshore: Enable Cognate Logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350140 [18:37:17] (03CR) 10Addshore: [C: 032] Enable Cognate Logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350140 (owner: 10Addshore) [18:37:30] (03PS3) 10Addshore: wmgUseGettingStarted true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347820 [18:37:41] (03PS2) 10Addshore: WMDE Spring campaign - Remove logging (no longer needed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347818 [18:37:55] (03CR) 10Addshore: [C: 032] Enable Cognate Logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350140 (owner: 10Addshore) [18:40:27] (03Merged) 10jenkins-bot: Enable Cognate Logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350140 (owner: 10Addshore) [18:40:36] (03CR) 10jenkins-bot: Enable Cognate Logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350140 (owner: 10Addshore) [18:40:53] (03CR) 10Addshore: [C: 032] wmgUseGettingStarted true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347820 (owner: 10Addshore) [18:41:34] !log addshore@naos Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:350140|Enable Cognate Logging]] (duration: 00m 48s) [18:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:25] (03Merged) 10jenkins-bot: wmgUseGettingStarted true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347820 (owner: 10Addshore) [18:42:41] (03CR) 10jenkins-bot: wmgUseGettingStarted true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347820 (owner: 10Addshore) [18:42:55] (03PS1) 10Eevans: Link-in upgraded cassandra-metrics-collector jar [puppet] - 10https://gerrit.wikimedia.org/r/350632 (https://phabricator.wikimedia.org/T163936) [18:43:29] (03CR) 10Addshore: [C: 032] WMDE Spring campaign - Remove logging (no longer needed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347818 (owner: 10Addshore) [18:44:05] !log addshore@naos Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:347820|wmgUseGettingStarted true for dewiki]] (duration: 00m 48s) [18:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:31] (03CR) 10Eevans: [C: 04-1] "Not yet; Not before https://gerrit.wikimedia.org/r/350503 (and the corresponding deploy)" [puppet] - 10https://gerrit.wikimedia.org/r/350632 (https://phabricator.wikimedia.org/T163936) (owner: 10Eevans) [18:45:02] (03Merged) 10jenkins-bot: WMDE Spring campaign - Remove logging (no longer needed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347818 (owner: 10Addshore) [18:45:11] (03CR) 10jenkins-bot: WMDE Spring campaign - Remove logging (no longer needed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347818 (owner: 10Addshore) [18:46:31] !log addshore@naos Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:347818|WMDE Spring campaign - Remove logging (no longer needed)]] (duration: 00m 47s) [18:46:34] Amir1: i'm wondering if we can just make it active/active and have both enabled at the same time [18:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:55] Amir1: since i'm supposed to do that first for a short time.. and then switch [18:47:01] mutante: AFAIK it was the plan [18:47:34] !log Morning SWAT Done! [18:47:37] https://wikitech.wikimedia.org/wiki/Switch_Datacenter [18:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:54] "ores (active/active)" [18:48:10] Amir1: ok :) [18:48:19] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218833 (10Cmjohnson) is there anything I need to be doing for this? [18:49:10] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218835 (10Marostegui) Do you have any spare BBU available? [18:53:36] (03PS1) 10EBernhardson: sistsearch title-filter profile for wikivoyage as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350637 [18:55:07] (03PS2) 10EBernhardson: enable sistersearch title filter profile for wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350637 [18:57:30] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3218861 (10Cmjohnson) I've connected the ripe atlas via usb on iron to console port on the ripe-atlas [18:58:31] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [18:59:31] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T1900). Please do the needful. [19:03:53] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3218884 (10Cmjohnson) The ssd has been received and swapped. @Marostegui or someone else please fix the raid cfg and resolve. Thanks Return shipping number is USPS 9202 3946 5301 2435 4073 66 FEDEX (961191... [19:04:33] 06Operations, 10ops-eqiad, 15User-fgiunchedi: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#3218886 (10Cmjohnson) Returning the part they sent and need to contact HP that it did not work and have them send a new part or a tech to look at it. [19:05:12] 06Operations, 10ops-eqiad, 15User-fgiunchedi: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#3218888 (10Cmjohnson) return shipping tracking UPS 1Z 422 2AR 90 5200 6397 [19:06:06] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3218890 (10Cmjohnson) @Marostegui yes, I can use one from a decommissioned server. [19:08:39] mutante: I'm heading to bed, see you tomorrow [19:08:40] o/ [19:08:51] PROBLEM - Check whether ferm is active by checking the default input chain on kafka1018 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [19:09:11] PROBLEM - Check whether ferm is active by checking the default input chain on kafka1020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [19:09:11] PROBLEM - Check systemd state on kafka1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:09:31] PROBLEM - Check systemd state on kafka1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:09:46] jouncebot: now [19:09:47] For the next 1 hour(s) and 50 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T1900) [19:09:49] jouncebot: next [19:09:49] In 3 hour(s) and 50 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T2300) [19:14:45] (03CR) 10RobH: "I wouldn't break from the procedure, and I would leave the hostname management in place until the system is unracked." [dns] - 10https://gerrit.wikimedia.org/r/350113 (https://phabricator.wikimedia.org/T162952) (owner: 10Dzahn) [19:15:38] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T163690#3218898 (10Cmjohnson) A case has been opened with HP Your case was successfully submitted. Please note your Case ID: 5319274490 for future reference. [19:18:01] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#3218900 (10Cmjohnson) What am I supposed to be doing with this task? [19:18:48] 06Operations, 10ops-eqiad: ocg1001.eqiad.wmnet ipmi error - https://phabricator.wikimedia.org/T155692#3218902 (10Cmjohnson) @dzahn can you check this once it's reinstalled [19:20:10] (03PS1) 10Jgreen: move fundraisingdb read traffic off of frdb1002 to switch its port [dns] - 10https://gerrit.wikimedia.org/r/350642 [19:20:49] (03CR) 10Jgreen: [C: 032] move fundraisingdb read traffic off of frdb1002 to switch its port [dns] - 10https://gerrit.wikimedia.org/r/350642 (owner: 10Jgreen) [19:24:10] !log reedy@naos Synchronized wmf-config/CommonSettings.php: Run pdf processors in firejails T164000 (duration: 01m 20s) [19:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:31] (03PS1) 10Reedy: Run Pdf Processors in firejails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350643 (https://phabricator.wikimedia.org/T164000) [19:25:07] (03CR) 10Reedy: [C: 032] Run Pdf Processors in firejails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350643 (https://phabricator.wikimedia.org/T164000) (owner: 10Reedy) [19:26:29] (03Merged) 10jenkins-bot: Run Pdf Processors in firejails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350643 (https://phabricator.wikimedia.org/T164000) (owner: 10Reedy) [19:26:37] (03CR) 10jenkins-bot: Run Pdf Processors in firejails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350643 (https://phabricator.wikimedia.org/T164000) (owner: 10Reedy) [19:28:53] 06Operations, 06Release-Engineering-Team, 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3218909 (10RobH) [19:29:16] 06Operations, 06Release-Engineering-Team, 10vm-requests, 07Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3207976 (10RobH) 05Open>03Resolved Request granted, no objections, and I've created T164030 for setup. [19:29:50] (03PS1) 10RobH: mwreleases1001.eqiad.wmnet production dns entries [dns] - 10https://gerrit.wikimedia.org/r/350645 [19:30:28] (03CR) 10RobH: [C: 032] mwreleases1001.eqiad.wmnet production dns entries [dns] - 10https://gerrit.wikimedia.org/r/350645 (owner: 10RobH) [19:30:55] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: move frdb1002 from pfw1 to pfw2 - https://phabricator.wikimedia.org/T163268#3218937 (10Cmjohnson) 05Open>03Resolved This has been fixed....confirmed wokring via IRC w/Jeff cmjohnson1 okay done Jeff_Green he shut off 11/0/14 Jeff_Green it's responding to... [19:31:01] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 108, down: 1, dormant: 0, excluded: 3, unused: 0BRge-11/0/2: down - frdb1002BR [19:33:52] ottomata: around? [19:33:52] !log start mediawiki deployment train group 2 - all wikis to 1.29.0-wmf.21 [19:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:28] (03PS1) 10Jgreen: move fundraisingdb-read traffic back to frdb1002 [dns] - 10https://gerrit.wikimedia.org/r/350648 [19:34:51] (03PS1) 1020after4: all wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350649 [19:34:53] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350649 (owner: 1020after4) [19:35:08] (03CR) 10Jgreen: [C: 032] move fundraisingdb-read traffic back to frdb1002 [dns] - 10https://gerrit.wikimedia.org/r/350648 (owner: 10Jgreen) [19:37:06] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350649 (owner: 1020after4) [19:37:14] cmjohnson1: ya hey [19:37:15] (03CR) 10jenkins-bot: all wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350649 (owner: 1020after4) [19:37:46] hey ottomatta [19:37:49] 06Operations, 06Release-Engineering-Team, 13Patch-For-Review, 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3218946 (10RobH) [19:37:58] an1030 can I power off and reset the idra [19:38:21] https://phabricator.wikimedia.org/T162046 [19:41:19] mutante: feel free to merge https://gerrit.wikimedia.org/r/345838 [19:41:47] 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3218962 (10Cmjohnson) p:05Normal>03Low [19:43:38] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3218967 (10Cmjohnson) [19:43:42] 06Operations, 10ops-eqiad: es1019.eqiad.wmnet drac unresponsive - https://phabricator.wikimedia.org/T155691#3218965 (10Cmjohnson) 05Open>03Resolved @volans this was corrected when we did the rack move root@es1019.mgmt.eqiad.wmnet's password: /admin1-> [19:43:44] paravoid: last moment i found it's technically one more day, heh. but i'll do it very soon [19:44:05] just arrived at office and reverting the ORES switch to eqiad [19:44:05] ottomatat: analytics1030: can I power off? [19:45:32] 06Operations, 10ops-eqiad, 10Phabricator: phab1001 hdd port a failure - https://phabricator.wikimedia.org/T163960#3218977 (10Cmjohnson) I verified in bios that port A and port B do see the 2 disk drives, I did power off and back again. @robh: can you please paste log or output from install so I can provide p... [19:46:08] 06Operations, 10ops-eqiad: apply new hostname label for wmf4747/phab1001 - https://phabricator.wikimedia.org/T163940#3218980 (10Cmjohnson) 05Open>03Resolved this has been done...racktables updated [19:46:11] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3218982 (10Cmjohnson) [19:46:56] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3218993 (10Dzahn) [19:47:08] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/350652/ [19:47:17] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3181286 (10Dzahn) added decom checklist template from https://wikitech.wikimedia.org/wiki/Server_Lifecycle/reclaim_checklist [19:47:34] 06Operations, 10ops-eqiad, 10Analytics, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3218998 (10Cmjohnson) p:05Normal>03Low [19:47:34] just remove the type hint? :D [19:48:36] Yeah [19:48:49] For the other hook... [19:48:50] 'SpecialNewPagesFilters': Called after building form options at NewPages. [19:48:50] $special: the special page object [19:49:38] 06Operations, 10ops-eqiad, 15User-fgiunchedi: upgrade memory in prometheus100[34] - https://phabricator.wikimedia.org/T163385#3219004 (10Cmjohnson) @fgiunchedi Can you do Friday 4/28 @10am EST? or Monday same time? [19:49:44] I wonder if both hooks should still be registered.. But meh [19:51:30] (03PS1) 10Dzahn: cach::misc/ores: make ores active/active temporarily [puppet] - 10https://gerrit.wikimedia.org/r/350655 [19:52:04] 06Operations, 10ops-eqiad, 10Phabricator: phab1001 hdd port a failure - https://phabricator.wikimedia.org/T163960#3219012 (10RobH) Yep, attached the screen shot, it shows sata port failure A on every boot. {F7796566} [19:53:14] Amir1: ^ see commit message of that, i'm reverting it to how it was before, but first active/active for a moment, "soft switch" [19:53:41] RECOVERY - MegaRAID on ms-be1006 is OK: OK: optimal, 13 logical, 13 physical [19:55:15] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1006 - https://phabricator.wikimedia.org/T162347#3219051 (10Cmjohnson) 05Open>03Resolved Replaced the disk [19:55:37] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1057 - https://phabricator.wikimedia.org/T162135#3219053 (10Cmjohnson) p:05Normal>03Low [19:56:50] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3219063 (10Cmjohnson) @Ottomata @elukey Do you still want to spread out the nodes or okay to resolve this task? [19:57:23] 06Operations, 10ops-eqiad: es1019.eqiad.wmnet drac unresponsive - https://phabricator.wikimedia.org/T155691#3219073 (10Volans) @Cmjohnson: great! Thanks a lot! [19:58:26] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3219076 (10Ottomata) I'd like hear what elukey thinks, but I think we can resolve this. This Kafka hardware is slated to be decommed anyway, bu... [19:59:46] !log twentyafterfour@naos Synchronized php-1.29.0-wmf.21/extensions/FlaggedRevs/frontend/FlaggedRevsUI.hooks.php: deploy fix for T163994 (duration: 01m 04s) [19:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:55] T163994: Argument 1 passed to FlaggedRevsUIHooks::addHideReviewedFilter() must be an instance of ChangesListSpecialPage - https://phabricator.wikimedia.org/T163994 [20:01:03] nope... it just dies later [20:01:13] Fatal error: Call to undefined method SpecialNewpages::registerFilterGroup() in /srv/mediawiki/php-1.29.0-wmf.21/extensions/FlaggedRevs/frontend/FlaggedRevsUI.hooks.php on line 316 [20:01:31] PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:33] !log ocg1001 - reboot into PXE, re-install [20:01:33] ok I guess the train is halted. [20:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:57] matt_flaschen: ^^^ [20:02:06] I wonder if both hooks are needed [20:02:49] 06Operations, 10ops-eqdfw, 10Analytics, 06DC-Ops: SATA errors for stat1004 in the dmesg - https://phabricator.wikimedia.org/T162770#3219095 (10Cmjohnson) @elukey That is not a disk issue....What mode do you want the server to be in? Raid mode or AHCI? [20:02:51] RECOVERY - Host ocg1001 is UP: PING OK - Packet loss = 0%, RTA = 37.26 ms [20:03:04] !log 1.29.0-wmf.21 is blocked by T163994 [20:03:06] Reedy, whoops, on it. [20:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:18] (03CR) 10Dzahn: [C: 032] cach::misc/ores: make ores active/active temporarily [puppet] - 10https://gerrit.wikimedia.org/r/350655 (owner: 10Dzahn) [20:08:38] 06Operations, 10ops-eqiad: Decommission wmf3096 - https://phabricator.wikimedia.org/T147860#3219125 (10RobH) I think if its out of warranty more than 6 months, we should decommission it from spares. We won't want to spin up any new services or hosts on out of warranty hardware. [20:10:21] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [20:10:21] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [20:11:00] andrewbogott: ^ wt-static reinstall? [20:11:14] yeah, I'm upgrading things, php is broken for the moment [20:11:21] gotcha, 'k [20:11:40] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3219137 (10Cmjohnson) [20:11:44] 06Operations, 10ops-eqiad, 10DBA: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3219135 (10Cmjohnson) 05Open>03Resolved Reset the idrac and it appears that db1070 is not accessible from ipmi tool cmjohnson@db1070:~$ sudo ipmi-chassis --get-chassis-status System Power... [20:12:32] 06Operations, 13Patch-For-Review, 15User-Elukey: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#3219139 (10Cmjohnson) Removing ops-eqiad [20:13:21] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (70126 200000s) [20:13:21] RECOVERY - are wikitech and wt-static in sync on labtestweb2001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (70126 200000s) [20:15:15] !log run puppet on cache::misc to push ores change - cumin -b 5 -s 10 'R:class = role::cache::misc' 'run-puppet-agent -q' [20:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:35] (03PS2) 10Kaldari: Enable cookie blocking on all remaining production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350090 (https://phabricator.wikimedia.org/T162651) [20:16:50] !log ocg1001 - revoke old puppet cert, salt key [20:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:25] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Some swift disks wrongly mounted on 5 ms-be hosts - https://phabricator.wikimedia.org/T163673#3219168 (10Cmjohnson) @fgiunchedi Is there anything I can do to help with this? [20:18:32] !log ores is active/active now, for a short time [20:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:48] Reedy, https://gerrit.wikimedia.org/r/350662 [20:19:21] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [20:19:21] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [20:19:45] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3219182 (10Cmjohnson) @Marostegui I will be in the data center Friday 4/27 at 0930. Let's get this take care of right away. [20:20:38] !log ocg1001 - re-added to puppet, initial run, reinstall ongoing (T161158) [20:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:46] T161158: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158 [20:20:58] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3213695 (10jcrespo) Cmjohnson- we really appreciate the effort- we now these days you have lots and lots of work! [20:21:21] RECOVERY - are wikitech and wt-static in sync on labtestweb2001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (71182 200000s) [20:21:21] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (71182 200000s) [20:21:24] !log stripping a bunch of unneeded extensions from wikitech-static [20:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:33] 06Operations, 10ops-eqiad: hard-reset DRAC gadolinium.mgmt.eqiad.wmnet - https://phabricator.wikimedia.org/T158131#3219191 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Fixed this and enabled HT. [20:23:13] twentyafterfour: revert in https://gerrit.wikimedia.org/r/#/c/350663/ and matt_flaschen's fix in https://gerrit.wikimedia.org/r/#/c/350664/ [20:23:39] thanks Reedy [20:25:11] (03PS2) 10Dzahn: Revert "cache::misc: switch ores from codfw to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/350622 (owner: 10Ladsgroup) [20:26:30] (03PS3) 10Dzahn: Revert "cache::misc: switch ores from codfw to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/350622 (owner: 10Ladsgroup) [20:26:34] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3219210 (10Cmjohnson) [20:26:36] 06Operations, 10ops-eqiad: cp1066 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T163889#3219208 (10Cmjohnson) 05Open>03Resolved the mgmt cable was not seated correctly. Fixed that issue, I can login via mgmt and ipmi is working cmjohnson@cp1066:~$ sudo ipmi-chassis --get-chassis-status Syste... [20:26:51] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:26:51] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:27:01] PROBLEM - Disk space on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:27:01] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:27:01] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:27:01] PROBLEM - configured eth on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:27:21] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [20:27:21] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [20:27:39] (03CR) 10Dzahn: [C: 032] "halfak checked while it was active/active and things looked good. now going back to codfw-only was it was before the failed deploy yesterd" [puppet] - 10https://gerrit.wikimedia.org/r/350622 (owner: 10Ladsgroup) [20:27:52] (03PS2) 10Eevans: Update collector version to 3.1.4 [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/350503 (https://phabricator.wikimedia.org/T163936) [20:27:52] RECOVERY - Disk space on meitnerium is OK: DISK OK [20:27:52] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [20:27:52] RECOVERY - DPKG on meitnerium is OK: All packages OK [20:27:52] RECOVERY - configured eth on meitnerium is OK: OK - interfaces up [20:28:21] RECOVERY - are wikitech and wt-static in sync on labtestweb2001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (71225 200000s) [20:28:21] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (71225 200000s) [20:29:03] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3219234 (10Cmjohnson) @Marostegui Same thing as db1048? I can use a spare bbu from a decom server if you like or is this server nearing it's last days? [20:29:07] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3219235 (10Marostegui) Thanks Chris!!! @jcrespo, this means reconfigure the slaves as the masters will change IPs... [20:29:41] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [20:29:41] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [20:31:41] matt_flaschen: your patch didn't pass the gate [20:31:47] https://integration.wikimedia.org/ci/job/mwgate-composer-hhvm-jessie/695/console [20:31:55] Short array syntax must be used to define arrays [20:32:03] Multiple empty lines should not exist in a row; [20:32:04] !log ores/cache::misc: switch ores back to codfw-only - everything is like it was before the failed deploy yesterday again [20:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:31] RECOVERY - configured eth on ocg1001 is OK: OK - interfaces up [20:33:31] RECOVERY - Check whether ferm is active by checking the default input chain on ocg1001 is OK: OK ferm input default policy is set [20:33:38] 06Operations, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3219263 (10Cmjohnson) @jcrespo and @Marostegui d b1106 is racked, idrac/bios setup, switch cfg is done. dhcpd file is configured...ready for install [20:33:51] RECOVERY - Check size of conntrack table on ocg1001 is OK: OK: nf_conntrack is 0 % full [20:33:51] RECOVERY - DPKG on ocg1001 is OK: All packages OK [20:33:52] RECOVERY - Disk space on ocg1001 is OK: DISK OK [20:33:57] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3219268 (10Marostegui) [20:34:00] twentyafterfour, Reedy, fixed. [20:34:01] RECOVERY - dhclient process on ocg1001 is OK: PROCS OK: 0 processes with command name dhclient [20:34:01] RECOVERY - salt-minion processes on ocg1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:34:01] RECOVERY - MD RAID on ocg1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [20:34:21] PROBLEM - SSH on meitnerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:34:41] PROBLEM - puppet last run on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:11] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [20:35:17] thanks [20:35:31] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:36:14] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3219273 (10Cmjohnson) [20:36:51] PROBLEM - Check whether ferm is active by checking the default input chain on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:51] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:51] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:51] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:33] !log ocg1001 - has been reinstalled but ocg package deployment fails currently "has the minion key been accepted", should not be repooled just yet [20:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:01] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:01] PROBLEM - Disk space on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:01] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:01] PROBLEM - configured eth on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:32] PROBLEM - SSH on meitnerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:41] PROBLEM - puppet last run on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:47] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3219280 (10Cmjohnson) [20:38:49] 06Operations, 10ops-eqiad: ocg1001.eqiad.wmnet ipmi error - https://phabricator.wikimedia.org/T155692#3219278 (10Cmjohnson) 05Open>03Resolved idrac reset...I am now able to login via IPMI cmjohnson@ocg1001:~$ sudo ipmi-chassis --get-chassis-status System Power : on Power overload... [20:39:41] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:39:41] RECOVERY - Check whether ferm is active by checking the default input chain on meitnerium is OK: OK ferm input default policy is set [20:40:31] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures [20:40:32] PROBLEM - Check the NTP synchronisation status of timesyncd on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:40:41] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [20:40:41] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [20:40:51] RECOVERY - Disk space on meitnerium is OK: DISK OK [20:40:51] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [20:40:51] RECOVERY - DPKG on meitnerium is OK: All packages OK [20:40:51] RECOVERY - configured eth on meitnerium is OK: OK - interfaces up [20:41:13] 06Operations, 10ops-eqiad, 10Traffic: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#3219303 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson cable was not seated correctly. Fixed [20:41:47] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3219307 (10Marostegui) >>! In T163895#3219235, @Marostegui wrote: > Thanks Chris!!! > @jcrespo, this means reconfigure the slaves as the masters will change IPs... Nevermind, just realised we re... [20:41:58] 06Operations, 10Cassandra, 13Patch-For-Review, 06Services (blocked): setup/install restbase-dev100[123] - https://phabricator.wikimedia.org/T151075#3219308 (10Cmjohnson) [20:42:04] !log ppchelko@naos Started deploy [restbase/deploy@fcfc537]: Automatically rerender parsoid, only store summaries if they are changed [20:42:06] (03PS1) 10Abián: Fix colors to match style guide, Bug: T163048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350697 (https://phabricator.wikimedia.org/T163048) [20:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:21] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [20:47:01] PROBLEM - configured eth on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:47:02] 06Operations, 10ops-eqiad: ms-be1017 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T148016#2712265 (10Cmjohnson) This is definitely a h/w bug. I will need to power the server off and check the connections of the usb ports. I am not 100% if I can disable in bios. [20:47:31] !log twentyafterfour@naos Synchronized php-1.29.0-wmf.21/extensions/FlaggedRevs: deploy fix for T163994 (duration: 01m 17s) [20:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:41] T163994: Argument 1 passed to FlaggedRevsUIHooks::addHideReviewedFilter() must be an instance of ChangesListSpecialPage - https://phabricator.wikimedia.org/T163994 [20:48:51] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:48:52] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:01] PROBLEM - Disk space on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:01] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:01] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:06] 06Operations, 10ops-eqiad: ms-be1025 network down - https://phabricator.wikimedia.org/T148391#3219330 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson This hasn't happened since Oct 2016. Resolving this task. Should the error return please reopen [20:49:41] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [20:49:41] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [20:49:51] RECOVERY - Disk space on meitnerium is OK: DISK OK [20:49:51] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [20:49:51] RECOVERY - DPKG on meitnerium is OK: All packages OK [20:49:51] RECOVERY - configured eth on meitnerium is OK: OK - interfaces up [20:52:01] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:52:14] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 72745 msg: ocg_render_job_queue 0 msg [20:53:27] !log twentyafterfour@naos rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.21 [20:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:36] !log ppchelko@naos Finished deploy [restbase/deploy@fcfc537]: Automatically rerender parsoid, only store summaries if they are changed (duration: 11m 33s) [20:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:51] 06Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#3219360 (10Cmjohnson) [20:55:04] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:04] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:04] PROBLEM - Disk space on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:04] PROBLEM - configured eth on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:30] 06Operations, 10ops-eqiad: Physically decommission mw1001-mw1148 (except mw1017 and mw1099) - https://phabricator.wikimedia.org/T141522#3219364 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [20:55:44] PROBLEM - configured eth on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:44] PROBLEM - Disk space on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:54] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:54] PROBLEM - Check whether ferm is active by checking the default input chain on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:54] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:54] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:54] PROBLEM - dhclient process on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:34] PROBLEM - puppet last run on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:17] 06Operations, 10ops-eqiad, 15User-fgiunchedi: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3209252 (10Cmjohnson) @fgiunchedi Do you need me to do anything with this yet? [20:57:34] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures [20:57:34] RECOVERY - configured eth on gadolinium is OK: OK - interfaces up [20:57:35] RECOVERY - Disk space on gadolinium is OK: DISK OK [20:57:44] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:57:44] RECOVERY - Check whether ferm is active by checking the default input chain on meitnerium is OK: OK ferm input default policy is set [20:57:44] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [20:57:44] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [20:57:44] RECOVERY - dhclient process on gadolinium is OK: PROCS OK: 0 processes with command name dhclient [20:57:54] RECOVERY - Disk space on meitnerium is OK: DISK OK [20:57:54] RECOVERY - DPKG on meitnerium is OK: All packages OK [20:57:54] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [20:57:54] RECOVERY - configured eth on meitnerium is OK: OK - interfaces up [21:00:34] PROBLEM - DPKG on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:15] !log ppchelko@naos Started deploy [restbase/deploy@61c1ceb]: Automatically rerender parsoid, only store summaries if they are changed, don't rerender data-parsoid [21:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:44] PROBLEM - puppet last run on meitnerium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/bin/puppet-enabled] [21:01:59] 06Operations, 06Labs, 10wikitech.wikimedia.org: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3219386 (10Andrew) [21:02:04] PROBLEM - puppet last run on gadolinium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/bin/puppet-enabled] [21:02:24] PROBLEM - salt-minion processes on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:04] PROBLEM - MegaRAID on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:24] RECOVERY - salt-minion processes on gadolinium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:04:44] PROBLEM - configured eth on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:44] PROBLEM - Disk space on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:54] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:54] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:54] PROBLEM - Check whether ferm is active by checking the default input chain on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:54] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:54] PROBLEM - dhclient process on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:04] PROBLEM - Disk space on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:04] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:04] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:04] PROBLEM - configured eth on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:28] 06Operations, 06Labs, 10wikitech.wikimedia.org: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3219388 (10Andrew) I have three different sets of proposed answers: Option A: 1) Once a month 2) Someone in Ops (or, possibly, me) 3) Icinga should... [21:05:44] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:35] RECOVERY - DPKG on gadolinium is OK: All packages OK [21:06:35] RECOVERY - configured eth on gadolinium is OK: OK - interfaces up [21:06:45] RECOVERY - Disk space on gadolinium is OK: DISK OK [21:06:45] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:06:45] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:06:45] RECOVERY - Check whether ferm is active by checking the default input chain on meitnerium is OK: OK ferm input default policy is set [21:06:45] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [21:06:45] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [21:06:45] RECOVERY - dhclient process on gadolinium is OK: PROCS OK: 0 processes with command name dhclient [21:06:54] RECOVERY - Disk space on meitnerium is OK: DISK OK [21:06:54] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [21:06:54] RECOVERY - DPKG on meitnerium is OK: All packages OK [21:06:54] RECOVERY - configured eth on meitnerium is OK: OK - interfaces up [21:07:54] (03CR) 10Eevans: [V: 032 C: 032] Update collector version to 3.1.4 [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/350503 (https://phabricator.wikimedia.org/T163936) (owner: 10Eevans) [21:08:31] !log ppchelko@naos Finished deploy [restbase/deploy@61c1ceb]: Automatically rerender parsoid, only store summaries if they are changed, don't rerender data-parsoid (duration: 07m 16s) [21:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:44] PROBLEM - DPKG on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:09:44] PROBLEM - configured eth on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:09:44] PROBLEM - Disk space on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:10:24] PROBLEM - salt-minion processes on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:10:35] PROBLEM - SSH on gadolinium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:35] PROBLEM - SSH on meitnerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:11:14] RECOVERY - salt-minion processes on gadolinium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:11:24] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:11:24] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:11:36] 06Operations, 10ops-eqiad, 10Phabricator: phab1001 hdd port a failure - https://phabricator.wikimedia.org/T163960#3219408 (10Cmjohnson) A new service request has been placed with Dell. I will update once the order has been executed. Create Service Request: Service Tag 4LF7FB2 Confirmed: Request 947622575... [21:11:54] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:11:54] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:11:54] PROBLEM - Check whether ferm is active by checking the default input chain on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:11:54] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:11:54] PROBLEM - dhclient process on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:04] PROBLEM - Disk space on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:04] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:04] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:04] PROBLEM - configured eth on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:31] 06Operations: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3219412 (10Cmjohnson) [21:12:44] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [21:12:45] RECOVERY - Check whether ferm is active by checking the default input chain on meitnerium is OK: OK ferm input default policy is set [21:12:45] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:12:45] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [21:12:54] RECOVERY - dhclient process on gadolinium is OK: PROCS OK: 0 processes with command name dhclient [21:12:54] RECOVERY - Disk space on meitnerium is OK: DISK OK [21:12:54] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [21:12:54] RECOVERY - DPKG on meitnerium is OK: All packages OK [21:12:54] RECOVERY - configured eth on meitnerium is OK: OK - interfaces up [21:14:34] RECOVERY - DPKG on gadolinium is OK: All packages OK [21:14:34] RECOVERY - Disk space on gadolinium is OK: DISK OK [21:14:35] RECOVERY - configured eth on gadolinium is OK: OK - interfaces up [21:17:24] PROBLEM - salt-minion processes on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:35] PROBLEM - SSH on gadolinium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:17:35] PROBLEM - SSH on meitnerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:18:14] RECOVERY - salt-minion processes on gadolinium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:18:24] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:18:24] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:21:34] PROBLEM - SSH on gadolinium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:22:09] jouncebot: now [21:22:09] No deployments scheduled for the next 1 hour(s) and 37 minute(s) [21:22:34] PROBLEM - SSH on meitnerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:22:44] PROBLEM - DPKG on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:45] PROBLEM - configured eth on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:45] PROBLEM - Disk space on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:54] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:54] PROBLEM - Check whether ferm is active by checking the default input chain on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:54] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:54] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:55] PROBLEM - dhclient process on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:23:14] PROBLEM - NTP on gadolinium is CRITICAL: NTP CRITICAL: No response from NTP server [21:23:24] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:25:44] RECOVERY - Disk space on gadolinium is OK: DISK OK [21:25:44] RECOVERY - configured eth on gadolinium is OK: OK - interfaces up [21:25:45] RECOVERY - DPKG on gadolinium is OK: All packages OK [21:25:45] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:25:45] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [21:25:45] RECOVERY - dhclient process on gadolinium is OK: PROCS OK: 0 processes with command name dhclient [21:25:45] RECOVERY - Check whether ferm is active by checking the default input chain on meitnerium is OK: OK ferm input default policy is set [21:25:46] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [21:27:34] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:29:46] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3219432 (10jcrespo) No more scheduled downtime? Can T162681 be closed? [21:30:34] PROBLEM - SSH on gadolinium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:30:44] PROBLEM - SSH on meitnerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:30:44] PROBLEM - DPKG on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:44] PROBLEM - Disk space on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:44] PROBLEM - configured eth on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:04] RECOVERY - puppet last run on gadolinium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [21:31:43] mutante: what's up with all those alarms? [21:33:54] PROBLEM - Check whether ferm is active by checking the default input chain on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:33:55] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:33:55] PROBLEM - dhclient process on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:33:55] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:33:55] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:35:26] 06Operations, 10ops-eqiad, 10DBA: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3213695 (10jcrespo) > Nevermind, just remembered we replicate from fqdn and not IPs :) But mediawiki uses IPs. [21:36:24] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:36:34] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:36:35] RECOVERY - Disk space on gadolinium is OK: DISK OK [21:36:35] RECOVERY - configured eth on gadolinium is OK: OK - interfaces up [21:36:35] RECOVERY - DPKG on gadolinium is OK: All packages OK [21:36:35] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [21:36:44] RECOVERY - dhclient process on gadolinium is OK: PROCS OK: 0 processes with command name dhclient [21:36:45] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:36:45] RECOVERY - Check whether ferm is active by checking the default input chain on meitnerium is OK: OK ferm input default policy is set [21:36:45] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [21:36:45] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [21:38:04] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:38:05] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:38:05] PROBLEM - configured eth on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:38:05] PROBLEM - Disk space on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:38:14] PROBLEM - puppet last run on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:38:24] PROBLEM - salt-minion processes on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:55] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [21:39:55] RECOVERY - Disk space on meitnerium is OK: DISK OK [21:39:55] RECOVERY - configured eth on meitnerium is OK: OK - interfaces up [21:39:55] RECOVERY - DPKG on meitnerium is OK: All packages OK [21:40:04] RECOVERY - puppet last run on gadolinium is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures [21:40:14] RECOVERY - salt-minion processes on gadolinium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:40:24] RECOVERY - Check the NTP synchronisation status of timesyncd on meitnerium is OK: OK: synced at Thu 2017-04-27 21:40:20 UTC. [21:41:54] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:54] PROBLEM - dhclient process on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:54] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:54] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:55] PROBLEM - Check whether ferm is active by checking the default input chain on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:42:44] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:42:44] RECOVERY - dhclient process on gadolinium is OK: PROCS OK: 0 processes with command name dhclient [21:42:44] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [21:42:44] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [21:42:45] RECOVERY - Check whether ferm is active by checking the default input chain on meitnerium is OK: OK ferm input default policy is set [21:43:14] PROBLEM - configured eth on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:14] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:14] PROBLEM - Disk space on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:14] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:44] PROBLEM - SSH on meitnerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:44] PROBLEM - configured eth on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:44] PROBLEM - puppet last run on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:44] PROBLEM - DPKG on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:44] PROBLEM - Disk space on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:54] PROBLEM - dhclient process on gadolinium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:55] PROBLEM - Check size of conntrack table on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:55] PROBLEM - Check whether ferm is active by checking the default input chain on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:55] PROBLEM - Check systemd state on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:55] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:46:30] (03PS2) 10Mobrovac: Decom legacy citoid service hostname [puppet] - 10https://gerrit.wikimedia.org/r/350505 (https://phabricator.wikimedia.org/T133001) (owner: 10Jforrester) [21:46:45] RECOVERY - Check size of conntrack table on meitnerium is OK: OK: nf_conntrack is 0 % full [21:46:45] RECOVERY - Check whether ferm is active by checking the default input chain on meitnerium is OK: OK ferm input default policy is set [21:46:45] RECOVERY - dhclient process on gadolinium is OK: PROCS OK: 0 processes with command name dhclient [21:46:45] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:46:45] RECOVERY - Check systemd state on meitnerium is OK: OK - running: The system is fully operational [21:47:04] RECOVERY - Disk space on meitnerium is OK: DISK OK [21:47:04] RECOVERY - configured eth on meitnerium is OK: OK - interfaces up [21:47:04] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [21:47:04] RECOVERY - DPKG on meitnerium is OK: All packages OK [21:48:44] PROBLEM - SSH on gadolinium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:48:50] * volans looking [21:50:34] (03CR) 10Mobrovac: [C: 04-1] "Should go out on May 1st" [puppet] - 10https://gerrit.wikimedia.org/r/350505 (https://phabricator.wikimedia.org/T133001) (owner: 10Jforrester) [21:51:34] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:51:34] RECOVERY - SSH on meitnerium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:51:34] RECOVERY - configured eth on gadolinium is OK: OK - interfaces up [21:51:34] RECOVERY - Disk space on gadolinium is OK: DISK OK [21:51:34] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [21:51:35] RECOVERY - DPKG on gadolinium is OK: All packages OK [21:54:21] (03PS1) 10Mobrovac: Remove the citoid.wm.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/350748 (https://phabricator.wikimedia.org/T133001) [21:56:17] !log shutting down gadolinium, it came up 1h25m ago and stole the public IP from meitnerium [21:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:15] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3219480 (10chasemp) 05Open>03Resolved thanks, calling it good for now [21:59:48] (03CR) 10Eevans: [C: 031] Link-in upgraded cassandra-metrics-collector jar [puppet] - 10https://gerrit.wikimedia.org/r/350632 (https://phabricator.wikimedia.org/T163936) (owner: 10Eevans) [22:01:37] (03PS1) 10Reedy: Run l10nupdate monday to thursday [puppet] - 10https://gerrit.wikimedia.org/r/350749 (https://phabricator.wikimedia.org/T164035) [22:01:45] (03PS2) 10Eevans: Link-in upgraded cassandra-metrics-collector jar [puppet] - 10https://gerrit.wikimedia.org/r/350632 (https://phabricator.wikimedia.org/T163936) [22:02:15] 06Operations, 10ops-eqiad, 10hardware-requests: decommission gadolinium - https://phabricator.wikimedia.org/T164036#3219495 (10RobH) [22:02:25] (03CR) 10Eevans: [C: 031] "This is now ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/350632 (https://phabricator.wikimedia.org/T163936) (owner: 10Eevans) [22:04:29] 06Operations, 10ops-eqiad, 10hardware-requests: decommission gadolinium - https://phabricator.wikimedia.org/T164036#3219514 (10RobH) [22:10:57] 06Operations, 10ops-eqiad, 10hardware-requests: decommission gadolinium - https://phabricator.wikimedia.org/T164036#3219540 (10RobH) a:05RobH>03Cmjohnson [22:11:28] 06Operations, 10ops-eqiad, 10hardware-requests: decommission gadolinium - https://phabricator.wikimedia.org/T164036#3219495 (10RobH) Ok, did all the initial steps, and now it needs disks wiped, then remaining decom steps. [22:11:49] (03CR) 10Greg Grossmeier: [C: 031] "Yup, I had the same patch ready but you were faster. :)" [puppet] - 10https://gerrit.wikimedia.org/r/350749 (https://phabricator.wikimedia.org/T164035) (owner: 10Reedy) [22:15:13] (03PS1) 10Faidon Liambotis: Remove the ipv6::relay/miredo manifests [puppet] - 10https://gerrit.wikimedia.org/r/350754 [22:26:30] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3219587 (10Volans) Please ensure also that remote IPMI is working, eventually applying the fix in T150160, because right now is not: ``` neodymium 0 ~$ ipmitool -I lanplus -H ocg1001.mgmt.eqiad.wmnet -U root -E ch... [22:30:04] (03PS1) 10Thcipriani: Scap: update version to 3.5.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/350757 (https://phabricator.wikimedia.org/T127762) [22:33:19] who fixed ocg1001 ?:) [22:33:31] when i left it was still having a problem, now it doesnt [22:34:54] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3219637 (10Dzahn) it has been reinstalled and re-added to puppet and salt but i saw errors about failed deployment of the ocg app, so i didn't repool it yet... then after being afk for a while i came back and look a... [22:34:55] I didn't touch it [22:35:53] so i reinstalled it and ran puppet and it showed errors about errors when deploying the ocg app. something about "has the minion key been signed?" [22:36:09] then i just went away for a while and come back and now ocg is running [22:36:13] and puppet runs without issues :p [22:36:13] (03CR) 10Jforrester: [C: 04-1] "Not before 1 May." [dns] - 10https://gerrit.wikimedia.org/r/350748 (https://phabricator.wikimedia.org/T133001) (owner: 10Mobrovac) [22:36:31] yes, i had signed the salt key [22:36:36] remote ipmi doesn't work though [22:36:41] yea, i saw that [22:36:55] so i was about to comment "it's reinstalled but not pooled yet, so you can take it down again to do that" [22:37:00] and then i see it works [22:38:14] mutante shows [22:38:15] [21:52:14] <+icinga-wm> RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 72745 msg: ocg_render_job_queue 0 msg [22:39:20] note that the time is bst and not utc so it's +1 from utc. [22:39:35] ok, the issue was " "has the minion key been accepted", i assume something needed a little time to realize the old salt key was revoked and thew new one added [22:41:54] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3219650 (10Dzahn) looks like it's ok. <+icinga-wm>IRECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 72745 msg: ocg_render_job_queue 0 msg ``` @ocg1001:~# ps aux | grep ocg root 3667 0.0 0.0 1187... [22:42:16] (03PS1) 10RobH: install params for mwreleases1001 [puppet] - 10https://gerrit.wikimedia.org/r/350760 [22:42:35] leaves it de-pooled so that the IPMI thing can be done first [22:42:40] 06Operations, 06Release-Engineering-Team, 13Patch-For-Review, 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3219652 (10RobH) [22:44:13] (03CR) 10RobH: [C: 032] install params for mwreleases1001 [puppet] - 10https://gerrit.wikimedia.org/r/350760 (owner: 10RobH) [22:45:01] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3219659 (10Dzahn) @cmjohnson it's still ok to take it down, it's not getting traffic. i'll wait for the IPMI issue before putting it back to work. [22:49:49] volans: https://phabricator.wikimedia.org/T155692#3219278 [22:50:20] mutante: that's local, not remote [22:51:07] ah... [22:54:28] 06Operations: Racktables: clearly show when hosts are decommissioned - https://phabricator.wikimedia.org/T164042#3219696 (10Volans) [22:54:37] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#3219709 (10Dzahn) [22:54:39] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3219708 (10Dzahn) [22:57:55] jouncebot: refresh [22:57:58] I refreshed my knowledge about deployments. [22:58:17] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#3219712 (10Dzahn) This sounds similar to the other tickets linked to "tracking" task T162850. We have observed downthrottling to 200MHz on other servers before. The interesting part is that they were all `R32... [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170427T2300). Please do the needful. [23:00:04] RoanKattouw, ebernhardson, Krinkle, and Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:11] \o [23:01:22] I can do the SWAT today [23:01:58] o/ [23:02:12] Made it just in time, incl. Ayola sandwich! [23:02:22] (03CR) 10Catrope: [C: 032] enable sistersearch title filter profile for wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350637 (owner: 10EBernhardson) [23:02:23] RoanKattouw: Just got home a minute ago :D [23:02:30] Good timing [23:02:33] Oh nice re Ayola [23:02:49] There's also a nice Greek place in the galleria attached to the building where the new office will be [23:03:07] I'm sure that was a requirement for the location. [23:03:27] Although Ayola is still pretty close :) [23:04:04] there is also an Oasis Grill at market and 3rd, which imo is better than Ayola ;) [23:05:21] Oh yeah I've never tried that one [23:05:29] There's also another Ayola on Kearny nearish the new office [23:05:36] And a Working Girls at Kearny & Bus [23:05:36] h [23:05:41] how did swat turn into a chat about restaurants lol [23:05:52] Krinkle was hungry :P [23:05:53] 16:02:11  Made it just in time, incl. Ayola sandwich! [23:06:13] 06Operations, 06Release-Engineering-Team, 13Patch-For-Review, 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3219721 (10RobH) [23:06:19] poor Krinkle here have 5 dollars go get a subway [23:06:30] Yeah, they're not bad. I usually go the one on Drum St. They're not Greek though. It's a differnet way of making, nice, but I prefer Greek :) [23:06:37] Hehe. [23:06:50] (03Merged) 10jenkins-bot: enable sistersearch title filter profile for wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350637 (owner: 10EBernhardson) [23:06:59] (03CR) 10jenkins-bot: enable sistersearch title filter profile for wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350637 (owner: 10EBernhardson) [23:07:16] It's one of the things I like about visiting SF, having lunch :P [23:07:33] i never been to cali [23:07:35] There are a bunch of them around, there's also one at 4th & Howard [23:08:18] ebernhardson: Your change is on mw2017, please test [23:08:22] looking [23:08:28] RoanKattouw: willing to do a last minute SWAT change for SecurePoll, just ran into a bug with the form create that I'd really like to roll out the fix now (already done about to commit) so that I don't have to do a manual creation. [23:08:31] * paladox had lunch 12 hours ago [23:08:48] Jamesofur: Sure, add to the wiki page and ping me when done, then I'll refresh and add it in [23:09:02] o7 [23:09:11] Even here people had lunch 4 hours ago [23:09:14] Bingo ... :P [23:09:19] Not sure why Krinkle got lunch at 3:30pm :P [23:09:26] RoanKattouw: better late then never [23:09:41] Yeah, I guess there have been days that I had lunch at 3... [23:09:54] RoanKattouw: works as expected [23:09:54] i once ate lunch at 5:30 [23:09:56] Dereckson: You around for your SWAT patches? [23:10:12] (03CR) 10Catrope: [C: 032] Set ORES thresholds in new format for all enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349146 (https://phabricator.wikimedia.org/T162760) (owner: 10Catrope) [23:10:48] 06Operations, 06Release-Engineering-Team, 13Patch-For-Review, 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3219726 (10RobH) a:05RobH>03demon Ok, this isn't in site.pp yet, since I'm not sure which roles you want to assign. It is ready for addition th... [23:11:00] 06Operations, 06Release-Engineering-Team, 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3219729 (10RobH) [23:13:23] !log catrope@naos Synchronized wmf-config/CirrusSearch-common.php: Enable sistersearch title profile for wikivoyage (duration: 01m 19s) [23:13:29] Krinkle: Your change is ready for testing on mw2017 [23:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:52] (03PS4) 10Catrope: Set ORES thresholds in new format for all enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349146 (https://phabricator.wikimedia.org/T162760) [23:13:53] RoanKattouw: k, checking [23:13:58] (03CR) 10Catrope: [C: 032] Set ORES thresholds in new format for all enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349146 (https://phabricator.wikimedia.org/T162760) (owner: 10Catrope) [23:14:27] (03PS7) 10Faidon Liambotis: Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) [23:14:29] (03PS3) 10Faidon Liambotis: Switch add_ip6_mapped to use interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345568 (https://phabricator.wikimedia.org/T163196) [23:14:32] (03PS1) 10Faidon Liambotis: lvs: replace $::ipaddress_eth0 by $::ipaddress [puppet] - 10https://gerrit.wikimedia.org/r/350765 (https://phabricator.wikimedia.org/T163196) [23:14:33] (03PS1) 10Faidon Liambotis: dnsrecursor: use ipaddress6, not ipaddress6_eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350766 (https://phabricator.wikimedia.org/T163196) [23:14:35] (03PS1) 10Faidon Liambotis: labs: remove the _eth0 suffix from ipaddress facts [puppet] - 10https://gerrit.wikimedia.org/r/350767 (https://phabricator.wikimedia.org/T163196) [23:14:38] (03PS1) 10Faidon Liambotis: Remove c/p interface argument to add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/350768 (https://phabricator.wikimedia.org/T163196) [23:14:40] (03PS1) 10Faidon Liambotis: lvs: remove support for <= trusty [puppet] - 10https://gerrit.wikimedia.org/r/350769 [23:14:42] (03PS1) 10Faidon Liambotis: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) [23:14:44] (03PS1) 10Faidon Liambotis: cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) [23:14:56] (03Merged) 10jenkins-bot: Set ORES thresholds in new format for all enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349146 (https://phabricator.wikimedia.org/T162760) (owner: 10Catrope) [23:15:03] (03PS1) 10Faidon Liambotis: Add a new interface::alias definition [puppet] - 10https://gerrit.wikimedia.org/r/350773 [23:15:05] (03PS1) 10Faidon Liambotis: labs::dnsrecursor: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350774 [23:15:07] (03PS1) 10Faidon Liambotis: cassandra: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350775 [23:15:09] (03PS1) 10Faidon Liambotis: gerrit: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350776 [23:15:11] (03PS1) 10Faidon Liambotis: phabricator: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350777 [23:15:13] (03PS1) 10Faidon Liambotis: lists: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350778 [23:15:19] RoanKattouw: Good to go. FOr the most part though this is job queue, so can't fully excercise, but it's been confirmed locally and in beta. [23:16:07] (03CR) 10Paladox: phabricator: switch to interface::alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350777 (owner: 10Faidon Liambotis) [23:16:12] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: use ipaddress6, not ipaddress6_eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350766 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [23:16:58] !log catrope@naos Synchronized php-1.29.0-wmf.21/includes/deferred/LinksUpdate.php: Release prior row locks beforehand in LinksUpdate::updateCategoryCounts (T163801) (duration: 01m 01s) [23:17:00] RoanKattouw: yes I'm here [23:17:04] paladox: did you even look at the manifest? [23:17:06] (03CR) 10jerkins-bot: [V: 04-1] interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [23:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:07] T163801: WikiPage::updateCategoryCounts caused 14 minutes of lag on enwiki - https://phabricator.wikimedia.org/T163801 [23:18:10] (03PS2) 10Faidon Liambotis: dnsrecursor: use ipaddress6, not ipaddress6_eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350766 (https://phabricator.wikimedia.org/T163196) [23:18:12] (03PS2) 10Faidon Liambotis: labs: remove the _eth0 suffix from ipaddress facts [puppet] - 10https://gerrit.wikimedia.org/r/350767 (https://phabricator.wikimedia.org/T163196) [23:18:16] (03PS4) 10Faidon Liambotis: Switch add_ip6_mapped to use interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345568 (https://phabricator.wikimedia.org/T163196) [23:18:16] (03PS2) 10Faidon Liambotis: Remove c/p interface argument to add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/350768 (https://phabricator.wikimedia.org/T163196) [23:18:18] (03PS2) 10Faidon Liambotis: lvs: remove support for <= trusty [puppet] - 10https://gerrit.wikimedia.org/r/350769 [23:18:20] (03PS2) 10Faidon Liambotis: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) [23:18:22] (03PS2) 10Faidon Liambotis: cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) [23:19:34] (03PS3) 10Faidon Liambotis: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) [23:19:35] (03PS3) 10Faidon Liambotis: cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) [23:20:48] (03CR) 10jenkins-bot: Set ORES thresholds in new format for all enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349146 (https://phabricator.wikimedia.org/T162760) (owner: 10Catrope) [23:21:00] (03CR) 10jerkins-bot: [V: 04-1] interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [23:22:47] !log catrope@naos Synchronized wmf-config/InitialiseSettings.php: Set ORES thresholds in new format for all enabled wikis (T162760) (duration: 00m 53s) [23:22:47] RoanKattouw: Added, https://gerrit.wikimedia.org/r/#/c/350779/ (master) and https://gerrit.wikimedia.org/r/#/c/350780/ (cherry pick so that voteWiki gets it) [23:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:55] T162760: Make presence and targets of ORES filters configurable - https://phabricator.wikimedia.org/T162760 [23:22:56] (03PS2) 10Catrope: Enable responsive references on el.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348400 (https://phabricator.wikimedia.org/T163074) (owner: 10Dereckson) [23:23:01] (03CR) 10Catrope: [C: 032] Enable responsive references on el.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348400 (https://phabricator.wikimedia.org/T163074) (owner: 10Dereckson) [23:23:59] paravoid nope [23:24:06] (03Merged) 10jenkins-bot: Enable responsive references on el.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348400 (https://phabricator.wikimedia.org/T163074) (owner: 10Dereckson) [23:24:14] ah never mind [23:24:19] sorry just looked now [23:24:31] can you look at the manifest before leaving a comment next time? [23:25:17] yep sorry didnt realise there was a manifest [23:25:41] (03CR) 10jenkins-bot: Enable responsive references on el.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348400 (https://phabricator.wikimedia.org/T163074) (owner: 10Dereckson) [23:26:09] (03PS2) 10Catrope: Enable WikidataPageBanner on vi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350085 (https://phabricator.wikimedia.org/T163662) (owner: 10Dereckson) [23:26:14] (03CR) 10Catrope: [C: 032] Enable WikidataPageBanner on vi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350085 (https://phabricator.wikimedia.org/T163662) (owner: 10Dereckson) [23:26:41] Dereckson: Your responsive references change is on mw2017, please test [23:26:46] testing [23:27:16] (03Merged) 10jenkins-bot: Enable WikidataPageBanner on vi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350085 (https://phabricator.wikimedia.org/T163662) (owner: 10Dereckson) [23:27:35] (03CR) 10jenkins-bot: Enable WikidataPageBanner on vi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350085 (https://phabricator.wikimedia.org/T163662) (owner: 10Dereckson) [23:28:03] RoanKattouw: seems to work [23:29:14] !log catrope@naos Synchronized wmf-config/InitialiseSettings.php: Enable responsive references on elwiki (T163074) (duration: 00m 49s) [23:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:22] T163074: Make Reference Lists 'responsive' by default on elwiki - https://phabricator.wikimedia.org/T163074 [23:29:31] Dereckson: WikidataPageBanner is now on mw2017 [23:30:01] ok [23:32:02] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [23:32:30] RoanKattouw: works [23:32:50] Jamesofur: Is your change something you can test on mw2017, or does it only arise when doing something big like creating an election? [23:33:02] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [23:33:26] RoanKattouw: yeah, only arises when creating an election since it's the form sadly [23:33:34] OK, I'll just push it out then [23:33:37] appreciate it [23:34:01] !log catrope@naos Synchronized wmf-config/InitialiseSettings.php: Enable WikidataPageBanner on viwikivoyage (T163662) (duration: 00m 46s) [23:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:10] T163662: Enable the use of PAGEBANNER magic word on Vietnamese Wikivoyage - https://phabricator.wikimedia.org/T163662 [23:34:14] Thanks for the deploy [23:34:30] RainbowSprinkles: Hmm, does the scap canary check work when running from codfw, given that all the boxes it checks are in eqiad? [23:36:01] !log catrope@naos Synchronized php-1.29.0-wmf.21/extensions/SecurePoll/includes/pages/CreatePage.php: Stop gap for fix global election creation (T164043) (duration: 00m 43s) [23:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:11] T164043: SecurePoll requiring property_wiki for global elections - https://phabricator.wikimedia.org/T164043 [23:36:27] Alright, SWAT done [23:36:29] Thanks everybody [23:36:32] o7 thanks [23:42:32] RoanKattouw: Yeah I noticed that. Needs fixing [23:42:46] Easy hiera change [23:44:19] Long term should make it vary by DC [23:54:52] (03PS4) 10Faidon Liambotis: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) [23:54:54] (03PS4) 10Faidon Liambotis: cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) [23:59:15] What? We've been deploying without canary checks?