[00:15:21] (03PS1) 1020after4: Add a mysql slave port parameter and provide it's value via hiera [puppet] - 10https://gerrit.wikimedia.org/r/438128 (https://phabricator.wikimedia.org/T196604) [00:16:08] (03CR) 10jerkins-bot: [V: 04-1] Add a mysql slave port parameter and provide it's value via hiera [puppet] - 10https://gerrit.wikimedia.org/r/438128 (https://phabricator.wikimedia.org/T196604) (owner: 1020after4) [00:22:44] (03CR) 1020after4: [C: 031] Add a mysql slave port parameter and provide it's value via hiera [puppet] - 10https://gerrit.wikimedia.org/r/438128 (https://phabricator.wikimedia.org/T196604) (owner: 1020after4) [00:24:17] (03PS2) 1020after4: Add a mysql slave port parameter and provide it's value via hiera [puppet] - 10https://gerrit.wikimedia.org/r/438128 (https://phabricator.wikimedia.org/T196604) [00:25:41] (03CR) 1020after4: [C: 031] Add a mysql slave port parameter and provide it's value via hiera [puppet] - 10https://gerrit.wikimedia.org/r/438128 (https://phabricator.wikimedia.org/T196604) (owner: 1020after4) [00:38:56] 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4266331 (10Papaul) [00:48:49] 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4266350 (10Papaul) [01:45:54] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 404 (expecting: 200) [01:47:04] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [02:11:37] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@0346959]: Update mobileapps to 5ea008c [02:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:19] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@0346959]: Update mobileapps to 5ea008c (duration: 00m 42s) [02:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:15] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 820.48 seconds [04:22:59] (03PS1) 10KartikMistry: WIP: Update apertium-apy init scripts [puppet] - 10https://gerrit.wikimedia.org/r/438135 (https://phabricator.wikimedia.org/T194342) [04:49:24] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [04:52:45] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:12:49] 10Operations, 10ops-codfw, 10DBA: replace bad disk in db2059 - https://phabricator.wikimedia.org/T196709#4266499 (10Marostegui) @Papaul feel free to replace the disk as soon as you get it [05:16:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438136 (https://phabricator.wikimedia.org/T191316) [05:18:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438136 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:20:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438136 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:21:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438136 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:22:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1113:3316 for alter table (duration: 00m 52s) [05:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:17] !log Deploy schema change on db1113:3316 - T191316 T192926 T195193 T89737 [05:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:24] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:22:25] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:22:25] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:22:25] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:23:01] !log Stop MySQL on db1091 for reimage [05:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:28] !log Deploy sanitarium events on db1124 - T190704 [05:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:32] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:37:06] (03PS2) 10Marostegui: mariadb: Repool db1091 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437999 (owner: 10Jcrespo) [05:44:12] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4266537 (10Marostegui) pc2005 caught up - I will wait until Monday to make sure it is estable before repooling it and closing this ticket [05:48:49] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [05:51:58] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:56:50] (03CR) 10Marostegui: [C: 032] "db1091 has been reimaged + installed intel-microcodes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437999 (owner: 10Jcrespo) [05:58:28] (03Merged) 10jenkins-bot: mariadb: Repool db1091 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437999 (owner: 10Jcrespo) [05:58:40] (03CR) 10jenkins-bot: mariadb: Repool db1091 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437999 (owner: 10Jcrespo) [06:01:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 with low weight after reimage (duration: 00m 50s) [06:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:50] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4266555 (10Dzahn) Yes, create a new one, apply puppet role and run it would get us back to the state it was in. Tha... [06:06:37] (03PS4) 10Dzahn: Cleanup after migration of deployment servers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436284 (owner: 10Muehlenhoff) [06:09:02] (03CR) 10Dzahn: [C: 032] Cleanup after migration of deployment servers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436284 (owner: 10Muehlenhoff) [06:17:21] PROBLEM - Check whether ferm is active by checking the default input chain on db1115 is CRITICAL: Return code of 255 is out of bounds [06:17:22] PROBLEM - Disk space on db1115 is CRITICAL: Return code of 255 is out of bounds [06:17:33] PROBLEM - mysqld processes on db1115 is CRITICAL: Return code of 255 is out of bounds [06:17:34] PROBLEM - DPKG on db1115 is CRITICAL: Return code of 255 is out of bounds [06:17:35] PROBLEM - Check systemd state on db1115 is CRITICAL: Return code of 255 is out of bounds [06:17:52] PROBLEM - Check size of conntrack table on db1115 is CRITICAL: Return code of 255 is out of bounds [06:18:11] PROBLEM - configured eth on db1115 is CRITICAL: Return code of 255 is out of bounds [06:18:14] PROBLEM - MariaDB disk space on db1115 is CRITICAL: Return code of 255 is out of bounds [06:18:14] PROBLEM - dhclient process on db1115 is CRITICAL: Return code of 255 is out of bounds [06:18:21] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [06:18:31] PROBLEM - MD RAID on db1115 is CRITICAL: Return code of 255 is out of bounds [06:19:42] marostegui: is that you? ^^ [06:21:21] PROBLEM - puppet last run on db1115 is CRITICAL: Return code of 255 is out of bounds [06:21:51] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:22:08] (03PS1) 10Elukey: profile::analytics::refinery::job::sqoop_mw: avoid root cronspam [puppet] - 10https://gerrit.wikimedia.org/r/438140 (https://phabricator.wikimedia.org/T132324) [06:22:56] nope [06:22:57] I am checking [06:23:07] it is storage related [06:23:16] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::sqoop_mw: avoid root cronspam [puppet] - 10https://gerrit.wikimedia.org/r/438140 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [06:24:55] (03CR) 10Gilles: [C: 032] Remove now-optional performance survey description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437981 (https://phabricator.wikimedia.org/T196630) (owner: 10Gilles) [06:25:21] RECOVERY - Disk space on db1115 is OK: DISK OK [06:25:24] RECOVERY - mysqld processes on db1115 is OK: PROCS OK: 1 process with command name mysqld [06:25:24] RECOVERY - DPKG on db1115 is OK: All packages OK [06:25:28] Jun 8 06:15:03 db1115 systemd[1]: nagios-nrpe-server.service: Failed to fork: Cannot allocate memory [06:25:31] RECOVERY - Check systemd state on db1115 is OK: OK - running: The system is fully operational [06:25:40] yep [06:25:49] I was distracted with some old dmesg entries [06:25:50] I am sorry, I was on there and you hadn't arrived, service restarted [06:25:51] RECOVERY - Check size of conntrack table on db1115 is OK: OK: nf_conntrack is 0 % full [06:26:01] RECOVERY - configured eth on db1115 is OK: OK - interfaces up [06:26:04] RECOVERY - MariaDB disk space on db1115 is OK: DISK OK [06:26:04] RECOVERY - dhclient process on db1115 is OK: PROCS OK: 0 processes with command name dhclient [06:26:21] RECOVERY - MD RAID on db1115 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [06:26:21] RECOVERY - Check whether ferm is active by checking the default input chain on db1115 is OK: OK ferm input default policy is set [06:26:31] RECOVERY - puppet last run on db1115 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:26:34] apergos: where did you get that from? [06:26:40] the event scheduler sure does clutter up the logs [06:26:43] syslog [06:28:04] getting back off the box [06:28:57] I will create a ticket to follow up [06:30:11] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:30:58] (03PS3) 10Dzahn: Add a mysql slave port parameter and provide it's value via hiera [puppet] - 10https://gerrit.wikimedia.org/r/438128 (https://phabricator.wikimedia.org/T196604) (owner: 1020after4) [06:31:22] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:34:24] (03PS2) 10Gilles: Remove now-optional performance survey description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437981 (https://phabricator.wikimedia.org/T196630) [06:34:48] (03CR) 10Gilles: [C: 032] Remove now-optional performance survey description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437981 (https://phabricator.wikimedia.org/T196630) (owner: 10Gilles) [06:34:52] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/11418/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/438128 (https://phabricator.wikimedia.org/T196604) (owner: 1020after4) [06:35:26] (03PS4) 10Dzahn: phabricator: Add mysql slave port parameter and provide value via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/438128 (https://phabricator.wikimedia.org/T196604) (owner: 1020after4) [06:36:05] (03Merged) 10jenkins-bot: Remove now-optional performance survey description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437981 (https://phabricator.wikimedia.org/T196630) (owner: 10Gilles) [06:38:03] 10Operations, 10Maps-Sprint, 10Maps (Tilerator): Externalize tile storage for maps - https://phabricator.wikimedia.org/T196474#4266574 (10Pnorman) > Any of those solution is likely to require some changes to tilerator / kartotherian, so it is likely to require some development time (and it is unclear if / wh... [06:38:22] (03CR) 10jenkins-bot: Remove now-optional performance survey description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437981 (https://phabricator.wikimedia.org/T196630) (owner: 10Gilles) [06:40:02] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [06:44:17] (03CR) 10Dzahn: [C: 031] "i see puppet is disabled on prod hosts and upgrade is ongoing on ticket -> https://phabricator.wikimedia.org/T178905" [puppet] - 10https://gerrit.wikimedia.org/r/438035 (https://phabricator.wikimedia.org/T178905) (owner: 10Eevans) [06:44:45] !log bounce kafka mirror maker main-eqiad-to-main-codfw (kafka200*) due to errors in the logs (also lag metrics not displaying) [06:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:49] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T196630 Remove unneeded description from performance survey definition (duration: 00m 52s) [06:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:53] T196630: Remove unneeded description from performance survey definition - https://phabricator.wikimedia.org/T196630 [06:51:41] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={container_status,create_container,image_status,podsandbox_status,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:52:22] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 292.45 seconds [06:52:51] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:55:31] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:20] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438157 [06:58:55] (03PS1) 10Elukey: burrow: fix typo in logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/438159 [06:59:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438157 (owner: 10Marostegui) [06:59:52] (03CR) 10Elukey: [C: 032] burrow: fix typo in logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/438159 (owner: 10Elukey) [07:00:24] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438157 (owner: 10Marostegui) [07:00:36] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438157 (owner: 10Marostegui) [07:01:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increse weight for db1091 (duration: 00m 50s) [07:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:21] (03PS1) 10Muehlenhoff: Move php5 packages to contint class [puppet] - 10https://gerrit.wikimedia.org/r/438164 [07:16:18] (03PS1) 10Muehlenhoff: Remove obsolete compat code for PHP 5 [puppet] - 10https://gerrit.wikimedia.org/r/438167 [07:18:32] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438168 [07:22:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438168 (owner: 10Marostegui) [07:23:57] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438168 (owner: 10Marostegui) [07:25:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for db1091 (duration: 00m 50s) [07:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:02] (03PS2) 10Dzahn: cassandra: upgrade 3.x version to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/438035 (https://phabricator.wikimedia.org/T178905) (owner: 10Eevans) [07:27:02] (03CR) 10Dzahn: [C: 032] cassandra: upgrade 3.x version to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/438035 (https://phabricator.wikimedia.org/T178905) (owner: 10Eevans) [07:27:20] (03CR) 10Dzahn: [C: 032] "puppet is disabled on restbase hosts for this ongoing upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/438035 (https://phabricator.wikimedia.org/T178905) (owner: 10Eevans) [07:34:15] (03PS1) 10Marostegui: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438181 [07:39:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438181 (owner: 10Marostegui) [07:41:06] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438181 (owner: 10Marostegui) [07:42:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Restore db1091 original weight (duration: 00m 50s) [07:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:17] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438168 (owner: 10Marostegui) [07:43:19] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438181 (owner: 10Marostegui) [07:46:06] (03PS2) 10Muehlenhoff: Add initial Debianisation of debmonitor-client (WIP) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 [07:50:25] (03PS1) 10Muehlenhoff: Stop using DSA/DSS host keys [puppet] - 10https://gerrit.wikimedia.org/r/438190 [07:52:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438191 [07:52:35] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438191 [07:55:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438191 (owner: 10Marostegui) [07:57:16] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438191 (owner: 10Marostegui) [07:58:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1113:3316 after alter table (duration: 00m 51s) [07:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:40] (03CR) 10Gehel: [C: 031] "Looks reasonable to me. I have not followed closely the changes in T174110, but they seem to be merged and ready. I would wait for Ottomat" [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [08:03:49] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: pc2005 down - https://phabricator.wikimedia.org/T196339#4266663 (10jcrespo) p:05High>03Normal [08:13:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438200 (https://phabricator.wikimedia.org/T194870) [08:15:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438200 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [08:16:22] (03CR) 10Alexandros Kosiaris: [C: 031] Stop using DSA/DSS host keys [puppet] - 10https://gerrit.wikimedia.org/r/438190 (owner: 10Muehlenhoff) [08:17:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438200 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [08:18:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1066 for reboot (duration: 00m 50s) [08:18:42] !log Stop MySQL and reboot db1066 for intel-microcode install - T194870 [08:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:48] T194870: Failover s2 primary master - https://phabricator.wikimedia.org/T194870 [08:28:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438191 (owner: 10Marostegui) [08:28:49] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438200 (https://phabricator.wikimedia.org/T194870) (owner: 10Marostegui) [08:30:02] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438205 [08:31:37] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4266698 (10Krenair) It looks like uninstalling the default redis-tools version I had put on those hosts to test this (3:3.2.6-1 that came from http://deb.debian.org/debian) and instal... [08:32:50] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Compiler not happy:" [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [08:33:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438205 (owner: 10Marostegui) [08:34:52] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438205 (owner: 10Marostegui) [08:35:27] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 60536 MB (12% inode=99%) [08:35:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1066 after reboot (duration: 00m 50s) [08:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:23] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438205 (owner: 10Marostegui) [08:38:26] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: add new placeholder password [labs/private] - 10https://gerrit.wikimedia.org/r/438208 (https://phabricator.wikimedia.org/T196633) [08:39:00] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] hieradata: openstack: add new placeholder password [labs/private] - 10https://gerrit.wikimedia.org/r/438208 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [08:39:31] (03PS12) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [08:41:58] (03CR) 10Volans: "Looks good, see one comment inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/438116 (owner: 10Legoktm) [08:42:07] (03PS5) 10Alexandros Kosiaris: Add the nodes for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437995 (https://phabricator.wikimedia.org/T186748) [08:42:09] (03PS5) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [08:42:11] (03PS9) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [08:42:13] (03PS1) 10Alexandros Kosiaris: conftool: Only populate scripts if service in LVS [puppet] - 10https://gerrit.wikimedia.org/r/438209 [08:44:59] (03PS1) 10Elukey: role::kafka::monitoring: raise burrow api-version for most clusters [puppet] - 10https://gerrit.wikimedia.org/r/438211 [08:45:27] RECOVERY - Disk space on elastic1018 is OK: DISK OK [08:47:59] (03CR) 10Alexandros Kosiaris: [C: 032] "Per PCC (https://puppet-compiler.wmflabs.org/compiler02/11421/proton1001.eqiad.wmnet/change.proton1001.eqiad.wmnet.err) this fixes proton " [puppet] - 10https://gerrit.wikimedia.org/r/438209 (owner: 10Alexandros Kosiaris) [08:48:20] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11422/" [puppet] - 10https://gerrit.wikimedia.org/r/438211 (owner: 10Elukey) [08:48:39] (03PS2) 10Elukey: role::kafka::monitoring: raise burrow api-version for most clusters [puppet] - 10https://gerrit.wikimedia.org/r/438211 [08:48:45] snipered by Alex :D [08:52:06] RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:52:07] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [08:52:07] RECOVERY - puppet last run on proton2002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [08:54:27] RECOVERY - puppet last run on proton2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:01:00] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add even more passwords placeholders [labs/private] - 10https://gerrit.wikimedia.org/r/438214 (https://phabricator.wikimedia.org/T196633) [09:06:05] !log akosiaris@deploy1001 Started deploy [proton/deploy@97ec4bf]: (no justification provided) [09:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:32] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: more placeholders fixes [labs/private] - 10https://gerrit.wikimedia.org/r/438214 (https://phabricator.wikimedia.org/T196633) [09:07:43] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] openstack: eqiad1: more placeholders fixes [labs/private] - 10https://gerrit.wikimedia.org/r/438214 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:11:03] (03PS1) 10Marostegui: db-eqiad,db.codfw.php: Unify and update sanitarium comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438215 (https://phabricator.wikimedia.org/T190704) [09:12:21] (03PS2) 10Marostegui: db-eqiad,db.codfw.php: Unify and update sanitarium comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438215 (https://phabricator.wikimedia.org/T190704) [09:13:28] (03PS3) 10Marostegui: db-eqiad,db.codfw.php: Unify and update sanitarium comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438215 (https://phabricator.wikimedia.org/T190704) [09:15:16] (03CR) 10Marostegui: [C: 032] db-eqiad,db.codfw.php: Unify and update sanitarium comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438215 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [09:16:57] (03Merged) 10jenkins-bot: db-eqiad,db.codfw.php: Unify and update sanitarium comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438215 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [09:17:55] (03CR) 10jenkins-bot: db-eqiad,db.codfw.php: Unify and update sanitarium comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438215 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [09:18:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Unify and update sanitarium comments - T190704 (duration: 00m 50s) [09:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:27] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [09:19:30] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Unify and update sanitarium comments - T190704 (duration: 00m 50s) [09:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:28] (03PS13) 10Arturo Borrero Gonzalez: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [09:30:21] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "The compiler is finally happy:" [puppet] - 10https://gerrit.wikimedia.org/r/436337 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [09:31:37] !log merging https://gerrit.wikimedia.org/r/#/c/436337/ for the eqia1 openstack deployment (labcontrol1003/labcontrol1004) [09:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:16] PROBLEM - puppet last run on proton1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[proton/deploy] [09:42:42] Hello marostegui, can you help me with DB error on production? While importing, it's constantly giving me error saying "Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.0.205)" [09:43:29] What are you trying tyo do? [09:43:37] Is it a big transaction? [09:44:06] I'm importing a XML dump via Special:Import [09:44:10] importing and production don't go well toghether [09:44:17] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.082 second response time [09:44:19] Urbanecm: That sounds scary [09:44:44] jynus, I don't get you [09:45:01] importing a large batch always result in db locks [09:45:09] that should be done server-side imho [09:45:12] * Hauskatze out [09:45:20] Urbanecm: put a ticket with what you are trying to do, which is probably not the right way [09:45:26] PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[proton/deploy] [09:45:26] PROBLEM - puppet last run on proton2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[proton/deploy] [09:45:28] Urbanecm: If it is a big one, it is is probably a well deserve error :) [09:46:46] Urbanecm: and we will help you do it in a way that works if we can [09:47:37] PROBLEM - puppet last run on proton2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[proton/deploy] [09:48:33] Urbanecm: Normally, a good "fast" solution is normally import in smaller chunks [09:49:08] ok [09:52:46] RECOVERY - puppet last run on proton2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:59:36] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1951 bytes in 0.080 second response time [10:00:37] RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:00:37] RECOVERY - puppet last run on proton2002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [10:02:56] 10Operations, 10Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#4266872 (10MoritzMuehlenhoff) dbus is installed by default and while I've successfully tested restarts on a number of servers (sodium, db2093, dns4001, mw1318, ores1001)... [10:03:49] 10Operations, 10Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#4266874 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:10:56] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:24:08] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-netbox [puppet] - 10https://gerrit.wikimedia.org/r/438219 (https://phabricator.wikimedia.org/T135991) [10:24:43] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for uwsgi-netbox [puppet] - 10https://gerrit.wikimedia.org/r/438219 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:25:34] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-netbox [puppet] - 10https://gerrit.wikimedia.org/r/438219 (https://phabricator.wikimedia.org/T135991) [10:29:09] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable more components for labcontrol boxes [puppet] - 10https://gerrit.wikimedia.org/r/438220 (https://phabricator.wikimedia.org/T196633) [10:35:58] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#4266902 (10ayounsi) New mr1 is configured, interfaces renamed from fe- to ge-. Next step is to schedule a hard cut to swap them. [10:37:34] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable more components for labcontrol boxes [puppet] - 10https://gerrit.wikimedia.org/r/438220 (https://phabricator.wikimedia.org/T196633) [10:39:41] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#4266904 (10ayounsi) [10:54:43] (03PS3) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) [10:55:28] (03PS1) 10Marostegui: generate_dsns_table.sh: Update sanitarium hosts [software] - 10https://gerrit.wikimedia.org/r/438223 [10:56:45] (03CR) 10Marostegui: [C: 032] generate_dsns_table.sh: Update sanitarium hosts [software] - 10https://gerrit.wikimedia.org/r/438223 (owner: 10Marostegui) [10:56:47] marostegui: /me wondering if the selection could be done with a puppetdb query instead ;) [10:57:32] volans: probably, but we don't really use that script anymore, we are using compare.py these days, less risky and faster. But I thought I would update that line :) [10:57:34] (03Merged) 10jenkins-bot: generate_dsns_table.sh: Update sanitarium hosts [software] - 10https://gerrit.wikimedia.org/r/438223 (owner: 10Marostegui) [10:57:46] ack :) [11:09:23] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add placeholders required for keystone [labs/private] - 10https://gerrit.wikimedia.org/r/438226 (https://phabricator.wikimedia.org/T196633) [11:09:53] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] openstack: eqiad1: add placeholders required for keystone [labs/private] - 10https://gerrit.wikimedia.org/r/438226 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:16:22] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable more components for labcontrol boxes (keystone) [puppet] - 10https://gerrit.wikimedia.org/r/438220 (https://phabricator.wikimedia.org/T196633) [11:21:49] (03PS4) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable more components for labcontrol boxes (keystone) [puppet] - 10https://gerrit.wikimedia.org/r/438220 (https://phabricator.wikimedia.org/T196633) [11:25:05] (03CR) 10Arturo Borrero Gonzalez: [V: 032] "Catalog compiler seems happy:" [puppet] - 10https://gerrit.wikimedia.org/r/438220 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:52:41] (03CR) 10Dzahn: "This is the part that kept me from merging it earlier:" [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [11:57:47] (03CR) 10Dzahn: "..but after checking this again i see the IPs are only set in hiera based on ./hosts/ not on role. (for eqiad, for codfw it is by role)." [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [12:09:48] (03PS1) 10Dzahn: phabricator: set service IPs for phab1002 in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/438235 (https://phabricator.wikimedia.org/T196019) [12:11:32] (03PS2) 10Dzahn: phabricator: set service IPs for phab1002 in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/438235 (https://phabricator.wikimedia.org/T196019) [12:14:15] (03PS3) 10Dzahn: phabricator: set service IPs for phab1002 in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/438235 (https://phabricator.wikimedia.org/T196019) [12:14:35] (03PS4) 10Dzahn: phabricator: set service IPs for phab1002 in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/438235 (https://phabricator.wikimedia.org/T196019) [12:14:52] !log running phabricator public_task_dump script manually to confirm that it's working as expected. [12:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:19] (03CR) 10Dzahn: [C: 032] "not affecting prod since the IPs/names with 'phab1001' in it all have separate equivalents for 'phab1002' and 'git-ssh.wikimedia.org' is c" [puppet] - 10https://gerrit.wikimedia.org/r/438235 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [12:16:06] (03PS5) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable more components for labcontrol boxes (keystone) [puppet] - 10https://gerrit.wikimedia.org/r/438220 (https://phabricator.wikimedia.org/T196633) [12:16:10] (03PS2) 10Rduran: [WIP] Add unit tests for transfer.py and CumminExecution [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/437503 [12:17:42] twentyafterfour: hi ! ^ i just added new IPs for phab1002 in Hiera. but not affecting prod because i commented out the "git-ssh.wikimedia.org" part [12:17:56] just trying to prepare an actual switch as much as possible [12:18:04] (03PS6) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable more components for labcontrol boxes (keystone) [puppet] - 10https://gerrit.wikimedia.org/r/438220 (https://phabricator.wikimedia.org/T196633) [12:18:15] it was a reaction to the comments on https://gerrit.wikimedia.org/r/#/c/437300/ [12:18:18] twentyafterfour: also, welcome back :) [12:18:39] https://commons.wikimedia.org/wiki/Commons:Village_pump#Category:Pages_where_the_unstrip_size_limit_is_exceeded [12:18:42] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/438235/" [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [12:18:43] any idea? ^ [12:20:56] (03CR) 10Dzahn: [C: 032] "this added "+ListenAddress 10.64.48.21" to the sshd config" [puppet] - 10https://gerrit.wikimedia.org/r/438235 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [12:21:07] (03PS3) 10Rduran: [WIP] Add unit tests for transfer.py and CumminExecution [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/437503 [12:21:11] (03CR) 10Dzahn: "the last change above added "+ListenAddress 10.64.48.21" to the sshd config" [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [12:23:31] (03PS4) 10Dzahn: phabricator: add role to node phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) [12:24:06] mutante: thanks! [12:24:23] twentyafterfour: i think _now_ i can add the actual phabricator puppet role to phab1002 [12:24:29] without touching phab1001 [12:24:32] cool [12:24:48] except we will have to manually switch git-ssh.wikimedia.org later when we actually siwtch [12:24:54] to remove that IP from interface [12:24:57] thanks for merging the fix for the mysql slave port, btw [12:25:08] oh, you're welcome. yep [12:25:20] i was planning to add the role now and check it [12:25:25] but not make the actual switch.. just get closer [12:25:36] since i also still hvae to write all the reviews :p [12:26:02] and yea, we should probably announce that anyways [12:26:14] and we felt better to wait for you to be back for that [12:26:54] (03CR) 10Dzahn: [C: 032] phabricator: add role to node phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [12:28:39] wathces the puppet run on phab1002.. it is adding bacula , installs the packages and all that right now [12:29:18] other related patches: https://gerrit.wikimedia.org/r/#/q/topic:phab1002+(status:open+OR+status:merged) [12:30:34] now looking what exactly this does: https://gerrit.wikimedia.org/r/#/c/437615/1/hieradata/role/common/phabricator.yaml [12:32:15] ah, it's just for the firewall rules for # ssh between phabricator servers for clustering support [12:32:54] (03CR) 10Dzahn: [C: 032] "this is for the firewall rules for # ssh between phabricator servers for clustering support" [puppet] - 10https://gerrit.wikimedia.org/r/437615 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [12:33:02] (03PS2) 10Dzahn: hiera/phabricator: add phab1002 as phab server [puppet] - 10https://gerrit.wikimedia.org/r/437615 (https://phabricator.wikimedia.org/T196019) [12:33:34] oh,, we have an issue: [12:33:36] fatal: unable to access 'http://tin.eqiad.wmnet/phabricator/deployment/.git/': Failed to connect to tin.eqiad.wmnet port 80: Connection timed out [12:33:44] it should not try to clone from tin.. [12:33:53] gotta find out where tin is (hard)coded [12:34:13] mutante: Tyler already submitted a fix for that [12:34:24] aha1 ok [12:34:26] however, it requires a new version of scap package to be built [12:34:29] and uploaded [12:34:38] tin is inside scap source? [12:34:40] i see [12:34:44] not exactly [12:35:09] but the code that rewrites the git url uses the local hostname, however, it wasn't called early enough to be fully effective after the name changes [12:35:25] no hostnames hard coded in packages ;) [12:35:31] ah :) ok! [12:36:32] so if you have a way to expidite T196710 somehow, that'll get that fixed I think. [12:36:32] T196710: Update Debian Package for Scap3 to 3.8.2-1 - https://phabricator.wikimedia.org/T196710 [12:37:35] so so i have these left now: https://gerrit.wikimedia.org/r/#/q/topic:phab1002+(status:open) [12:37:52] looks like quite a lot [12:37:53] dumps, mtail, mariadb grants .. and the actual switch of the active server [12:38:17] dumps: not sure if it can be added like that additonally [12:38:28] mtail: i feel like something might be missing there [12:38:39] actual switch: waiting for maintenance window [12:38:45] mariadb: needs deploy from dba [12:39:06] the dumps can run on any server which has access to the db slave [12:39:08] twentyafterfour: but all this is already done https://gerrit.wikimedia.org/r/#/q/topic:phab1002+(status:merged) [12:39:28] it wont have access to db [12:40:18] yeah so after the mariadb grants ... [12:40:32] ok, yep [12:41:50] (03CR) 1020after4: [C: 031] "needs mariadb grants so that it can access the db slave. Then we need the cron job to run /srv/phab/tools/public_task_dump.py" [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [12:45:12] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4266982 (10Dzahn) role has been applied these things are done: https://gerrit.wikimedia.org/r/#/q/topic:phab1002+(status:merged) but these are still needed: https://gerrit.wikimedia.... [12:46:22] twentyafterfour: o/ - when the switch is done it would be awesome to write some shared SRE/Releng documentation about steps to do these kind of procedures [12:46:34] so in case of emergencies we'll be quicker in fixing [12:47:26] elukey: I've actually got a task already assigned to write some sort of disaster recovery plan for phabricator. [12:47:40] T190572 [12:47:40] T190572: Prepare a disaster recovery plan for failing over from phab1001 to phab2001 (or phab2001 to 1001) - https://phabricator.wikimedia.org/T190572 [12:48:14] twentyafterfour: <3 [12:48:17] eqiad to codfw and eqiad -> other eqiad need a bit different steps [12:48:32] mutante: ok, I'll make a note of that [12:48:35] for eqiad/codfw parts of it are all prepared in hiera [12:48:40] and applied per dc [12:48:54] for eqiad we have IPs applied via hostname [12:49:00] for codfw by role [12:49:16] this inconsistency was actually nice for a switch to phab1002 in this case [12:49:28] i could just set other IPs for phab1002 also by host [12:49:58] i will help writing docs [12:50:34] twentyafterfour: btw, i also asked about how to unblock the codfw db part [12:50:44] mutante: I added a comment on T190572 with what you just wrote [12:50:50] we can get it access without using the dbproxy i heard [12:50:54] until that exists [12:50:59] cool [12:51:01] great :) [12:51:31] ok, all sounds good. i will now have to switch to review writing. we can continue here [12:51:45] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#4266993 (10jcrespo) [12:51:54] ok good luck with reviews. I've still got a couple to finish as well. [12:52:12] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#2990470 (10jcrespo) [12:52:27] you too :) ttyl and then a nice weekend after that. cya [12:52:33] mutante: one last thing: can we maybe record that list of patches in the disaster recovery doc, an example of what's needed to switch it over? [12:52:34] elukey: same for you :) laters [12:52:53] I'll see if I can capture them from gerrit [12:52:57] twentyafterfour: yes, we can link to it by the topic: https://gerrit.wikimedia.org/r/#/q/topic:phab1002+(status:open+OR+status:merged) [12:53:03] topic phab1002 [12:53:42] the "bast-test" stuff in there is because phab1002 used to be bast-test in the past.. that is not normally going to happen [12:54:43] mutante: o/ [12:55:45] twentyafterfour: i didnt mention.. also i am in Europe now .. and EU timezones [12:55:51] for a couple weeks [12:58:49] mutante: ok cool, I didn't know ;) [12:59:57] elukey: ok I've started a page on wikitech: https://wikitech.wikimedia.org/wiki/Phabricator/Disaster_Recovery [13:00:38] It's pretty empty so far but I'll make some time to fill in some details, it could be a real big asset if something bad happens to the existing server. [13:03:02] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[phabricator/deployment] [13:12:59] ack! [13:14:05] checking puppet on phab1002 [13:16:54] Error: Execution of '/usr/bin/scap deploy-local --repo phabricator/deployment -D log_json:False' returned 70: 13:14:12 Fetch from: http://tin.eqiad.wmnet/phabricator/deployment/.git [13:17:43] Failed to connect to tin.eqiad.wmnet port 80: Connection timed out [13:18:01] ah it is already a spare, makes sense [13:18:02] why tin? git submodule? [13:18:26] no idea, I think it is configured somewhere [13:18:38] going to check [13:19:45] 10Operations, 10Wikimedia-Mailing-lists, 10Community-Liaisons (Jul-Sep 2018): Rename (create anew) the TC team mailing list - https://phabricator.wikimedia.org/T155683#4267051 (10Aklapper) [13:19:51] (03PS1) 10Elukey: profile::kafka::mirror::alerts: tune retries and its interval [puppet] - 10https://gerrit.wikimedia.org/r/438243 (https://phabricator.wikimedia.org/T196158) [13:26:41] elukey: volans : that was already answered further above [13:27:02] 08:34 < twentyafterfour> mutante: Tyler already submitted a fix for that [13:27:14] ack, thanks mutante [13:27:25] 08:35 < twentyafterfour> but the code that rewrites the git url uses the local hostname, however, it wasn't called early enough to be fully effective after the name changes [13:27:29] 08:35 < twentyafterfour> no hostnames hard coded in packages ;) [13:29:59] yes yes I was about to write, I was checking burrow alarms and got distracted :) I didn't check the past conversation before puppet :) [13:30:03] thanks! [13:30:46] mutante: I will check the grants now [13:30:47] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11431/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/438243 (https://phabricator.wikimedia.org/T196158) (owner: 10Elukey) [13:30:48] And get it done [13:31:26] marostegui: :) thank you [13:33:56] (03PS2) 10Marostegui: mariadb: add phab1002 to phabricator grants [puppet] - 10https://gerrit.wikimedia.org/r/437613 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [13:41:34] (03CR) 10Marostegui: [C: 032] mariadb: add phab1002 to phabricator grants [puppet] - 10https://gerrit.wikimedia.org/r/437613 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [13:42:45] mutante: it is now done, if you have a way to check it, it'd be nice to confirm :) [13:46:42] (03PS1) 10WMDE-leszek: Only enable repo-specific parts of WikibaseLexeme on wikidata wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438245 (https://phabricator.wikimedia.org/T195615) [13:48:27] marostegui: thank you. i can confirm this works: [13:48:36] [phab1002:~] $ mysql -u phuser -p -h m3-master.eqiad.wmnet [13:48:43] using the "app_pass" [13:48:49] Great! [13:49:54] -u phmanifest with $maniphest_pass works too [13:50:07] so many users and passwords, heh [13:50:27] bz_user , rt_user, fab_user, app_user, manifest_user, admin_user :) [13:50:30] haha [13:51:09] bz and rt are most likely not needed anynmore.. once Bugzilla and RT tickets had been imported [13:51:38] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 2 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4267092 (10mobrovac) Indeed it is the same issue. @thcipriani please ping us here once the new version of Scap is available in production. [13:53:28] 10Operations, 10Scap, 10Patch-For-Review: Update Debian Package for Scap3 to 3.8.2-1 - https://phabricator.wikimedia.org/T196710#4267096 (10mobrovac) [13:53:31] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 2 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4267095 (10mobrovac) [13:53:57] The rt and bz users etc are used by the ongoing integration scripts but I have no idea if still running. Probably not. [13:54:13] twentyafterfour: would know [13:54:25] gotcha, thanks chase [14:12:51] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685#4267110 (10elukey) Hi Rob! So as far as I can see only rdb100[56] have a not-expired warranty, even rdb100[78] are super old. I think that the plan is to keep rdb100... [14:14:40] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736#4267113 (10Marostegui) [14:17:21] !log upgrade Cassandra to 3.11.2, restbase1011 & restbase1016 - T178905 [14:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:26] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [14:19:12] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [14:19:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003 SMART failure (again) - https://phabricator.wikimedia.org/T196704#4267133 (10chasemp) a:03Cmjohnson I wouldn't be very fun to lose this server before {T193655} is sorted that's for sure :) [14:21:35] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4267139 (10chasemp) >>! In T193655#4264714, @Cmjohnson wrote: > @chasemp I do not have 2 adjacent 10G racks and do not have space in 2 10G racks in the s... [14:21:53] 10Operations, 10ops-codfw: rack/setup/install graphite2003 - https://phabricator.wikimedia.org/T196483#4267141 (10Papaul) [14:22:32] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:30:14] (03PS4) 10Aklapper: phabricator: List new and recent assignees [puppet] - 10https://gerrit.wikimedia.org/r/435984 (https://phabricator.wikimedia.org/T195780) [14:34:02] (03CR) 10Aklapper: "PS4 is definitely my last attempt: Not querying the expensive transaction table at all and restricting to tasks that have an assignee and " [puppet] - 10https://gerrit.wikimedia.org/r/435984 (https://phabricator.wikimedia.org/T195780) (owner: 10Aklapper) [14:35:21] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4267174 (10faidon) p:05Normal>03High [14:48:59] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4267196 (10chasemp) My understanding of the current situation: * Currently labnet1003 only shows eth0 connected (should be both eth0 and eth1 if 1G), and labnet1004 is n... [14:49:02] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [14:52:22] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:57:08] 10Operations, 10Cassandra, 10User-Eevans: Add Cassandra 3.11.2 package to internal APT repository - https://phabricator.wikimedia.org/T196745#4267217 (10Eevans) p:05Triage>03Normal [14:57:22] 10Operations, 10Cassandra, 10User-Eevans: Add Cassandra 3.11.2 package to internal APT repository - https://phabricator.wikimedia.org/T196745#4267217 (10Eevans) 05Open>03stalled p:05Normal>03Low [15:00:09] !log upgrade Cassandra to 3.11.2, restbase1008 - T178905 [15:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:13] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [15:01:58] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10procurement: upgrade storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651#4267245 (10RobH) a:05faidon>03Cmjohnson [15:11:37] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736#4267289 (10RobH) [15:17:28] (03PS5) 10Andrew Bogott: horizon: fix Horizon title branding [puppet] - 10https://gerrit.wikimedia.org/r/436951 (https://phabricator.wikimedia.org/T196199) (owner: 10Chico Venancio) [15:18:07] (03CR) 10Andrew Bogott: [C: 032] horizon: fix Horizon title branding [puppet] - 10https://gerrit.wikimedia.org/r/436951 (https://phabricator.wikimedia.org/T196199) (owner: 10Chico Venancio) [15:18:28] (03PS1) 10Marostegui: realm.pp: Add idprivatewikimedia as private wiki [puppet] - 10https://gerrit.wikimedia.org/r/438264 (https://phabricator.wikimedia.org/T196748) [15:22:25] (03PS2) 10Marostegui: realm.pp: Add idprivatewikimedia as private wiki [puppet] - 10https://gerrit.wikimedia.org/r/438264 (https://phabricator.wikimedia.org/T196748) [15:23:36] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11432/" [puppet] - 10https://gerrit.wikimedia.org/r/438264 (https://phabricator.wikimedia.org/T196748) (owner: 10Marostegui) [15:26:36] !log Restart MySQL on codfw sanitarium hosts db2094, db2095 - https://phabricator.wikimedia.org/T196748 [15:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:26] (03PS1) 10Rush: openstack: labtest hw refresh initial role [puppet] - 10https://gerrit.wikimedia.org/r/438265 (https://phabricator.wikimedia.org/T196000) [15:31:08] (03CR) 10Rush: [C: 032] openstack: labtest hw refresh initial role [puppet] - 10https://gerrit.wikimedia.org/r/438265 (https://phabricator.wikimedia.org/T196000) (owner: 10Rush) [15:31:13] (03PS2) 10Rush: openstack: labtest hw refresh initial role [puppet] - 10https://gerrit.wikimedia.org/r/438265 (https://phabricator.wikimedia.org/T196000) [15:35:26] !log upgrade Cassandra to 3.11.2, restbase1012 - T178905 [15:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:33] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [15:36:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1019 IPMI alert - https://phabricator.wikimedia.org/T196751#4267385 (10Andrew) [15:37:19] 10Operations, 10Cloud-VPS, 10Patch-For-Review: move/setup/install labtestnet2003(WMF6469) - https://phabricator.wikimedia.org/T196000#4267396 (10chasemp) 05Open>03Resolved [15:40:08] 10Operations, 10DBA, 10Traffic, 10Patch-For-Review: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4267398 (10Marostegui) p:05Triage>03Normal [16:01:09] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507#4259081 (10Bstorm) Apparently the batter was changed on this already on T194907. However, this doesn't seem to have fixed the issue. One controller having no drives seems perfectly acceptable since it isn't ac... [16:07:12] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4267463 (10Bstorm) I will add that 1G ethernet is likely fine for these servers, being labvirts, unless I was very out of the loop on some... [16:14:20] jouncebot: next [16:14:20] In 66 hour(s) and 45 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180611T1100) [16:14:39] jynus: Good evening [16:15:01] jynus: Are you free for q quick question on schema diff between labs and analytics store? [16:15:36] sorry, joal not today, manuel is on charge today [16:16:29] or wait until tuesday [16:16:44] np jynus - I'll ask manuel :) [16:18:29] marostegui: Good evening [16:18:48] marostegui: se above, I have question - Is now ok time? [16:19:38] (03PS1) 10Urbanecm: Change $wgMetaNamespace and $wgMetaNamespaceTalk for idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438271 (https://phabricator.wikimedia.org/T196744) [16:19:56] joal: I am logging of right now sorry, started working at 7.30am, so it has been quite a while now, can you wait till Monday? [16:20:12] marostegui: I can indeed :) No rush for us [16:20:18] marostegui: enjoy your weekend :) [16:20:23] you too! thanks! [16:21:04] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1020 - https://phabricator.wikimedia.org/T194855#4210955 (10Bstorm) This is the same alert as on T196507 and seems about the same issue (no battery reported and one controller reports no drives). I'd ask why SSD wear would matter, but I know the ways of vendo... [17:02:59] (03PS1) 10Urbanecm: Whitelist *.jpl.nasa.gov [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438273 (https://phabricator.wikimedia.org/T196727) [17:05:49] (03CR) 10Legoktm: admin: Port matrix.py to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/438116 (owner: 10Legoktm) [17:09:27] (03PS1) 10Urbanecm: idprivatewikimedia: register in DNS [dns] - 10https://gerrit.wikimedia.org/r/438275 (https://phabricator.wikimedia.org/T196747) [17:10:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003 SMART failure (again) - https://phabricator.wikimedia.org/T196704#4267602 (10Cmjohnson) @chasemp Replaced the disk today it was disk in slot 2 on array 2. megacli did not report a disk failure but the disk was flashing between green... [17:12:29] (03PS1) 10Urbanecm: idwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/438276 (https://phabricator.wikimedia.org/T196747) [17:12:39] !log gerrit: taking offline for 2.14 -> 2.15 upgrade [17:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:04] yay [17:13:10] :DDD [17:14:20] !log demon@deploy1001 Started deploy [gerrit/gerrit@7324140]: 2.15.2 [17:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:31] !log demon@deploy1001 Finished deploy [gerrit/gerrit@7324140]: 2.15.2 (duration: 00m 11s) [17:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:56] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.85 and port 29418: Connection refused [17:15:57] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:17:02] :) [17:17:45] Ok, init done, running start [17:17:56] wooo and notedb begins! [17:18:07] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [17:18:37] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4267621 (10Cmjohnson) The cabling has been fixed. Both servers are now connected to asw2-b-eqiad. They are ready for install ge-7/0/9 up up labnet1003 eth0... [17:20:17] Hmm, permissions wrong on plugins, failed to load. Easy fix.... [17:20:23] (probably checked in wrong to scap repo?) [17:20:39] no_justification did you press y for the plugins? [17:20:49] or did you just pressed enter? [17:21:16] I run with --batch [17:21:17] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.2-2-g33ec53f938 (SSHD-CORE-1.6.0) (protocol 2.0) [17:21:21] So it should just load what's there [17:21:32] ah ok [17:21:42] no_justification need's a reindex [17:21:43] [2018-06-08 17:21:15,551] [sshd-SshServer[782e0844]-nio2-thread-5] ERROR com.google.gerrit.server.account.externalids.ExternalIdReader : Ignoring invalid external ID note da39a3ee5e6b4b0d3255bfef95601890afd80709 [17:21:43] Server error: 'is:wip' operator is not supported by change index version [17:21:46] Otherwise, looks good [17:22:13] Errrr [17:22:14] Hmmm [17:22:25] https://gerrit-review.googlesource.com/c/homepage/+/183493 [17:22:31] Oh, I guess batch doesn't reindex [17:22:36] I'll do an online reindex, yes? [17:22:52] either by online or offline reindex. [17:22:53] yep [17:23:06] PROBLEM - puppet last run on db1102 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [17:24:16] may take up to 15 minutes for users to see the new polygerrit update [17:24:29] (03PS1) 10Papaul: DNS: ADD mgmt & prod DNS entries for backup2001 [dns] - 10https://gerrit.wikimedia.org/r/438277 (https://phabricator.wikimedia.org/T196477) [17:26:36] Ok, started groups. What's the name of all the indexes again? [17:26:37] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [17:26:42] * paladox looks [17:26:47] accounts, groups and changes [17:27:00] https://gerrit-review.googlesource.com/Documentation/cmd-index-start.html [17:27:04] Ok, accounts too [17:27:10] yep [17:28:16] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477#4267632 (10Papaul) [17:28:32] !log powering flerovium down to move to a different space in the rack [17:28:34] Ok changes started too [17:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:53] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477#4258119 (10Papaul) [17:29:15] :) [17:29:55] (03PS2) 10Nehajha: Read rcfile if it exists and parse arguments from it using configparser [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) [17:31:07] PROBLEM - Host flerovium is DOWN: PING CRITICAL - Packet loss = 100% [17:31:37] https://phabricator.wikimedia.org/rGRBDbca87379855d0cd68bb2abd232e568a379d1a664 looks to be a notedb commit! [17:31:38] PROBLEM - Host flerovium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:33:45] !log gerrit: up mostly, but will see some errors about "wip" label for a bit until reindexing completes. [17:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:56] Otherwise, looks like we're good? [17:34:19] yep! [17:34:27] no_justification notedb seems to have kicked in! [17:34:33] phab is picking up the commits now [17:34:41] so all commits will have a log now [17:34:47] RECOVERY - Device not healthy -SMART- on labstore1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1003&var-datasource=eqiad%2520prometheus%252Fops [17:34:47] commits as in comments [17:35:23] (03PS1) 10Urbanecm: idprivatewikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 [17:35:30] Reindexer slowly churning thru puppet repo now [17:35:37] heh [17:36:17] PROBLEM - MegaRAID on labstore1003 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) [17:36:20] ACKNOWLEDGEMENT - MegaRAID on labstore1003 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T196757 [17:36:25] 10Operations, 10ops-eqiad: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T196757#4267659 (10ops-monitoring-bot) [17:36:38] RECOVERY - Host flerovium is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [17:41:58] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 353.23 seconds [17:42:18] RECOVERY - Host flerovium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [17:42:28] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 372.53 seconds [17:44:02] (03PS1) 10Bstorm: wiki replicas: place grant before create in maintain-views script [puppet] - 10https://gerrit.wikimedia.org/r/438280 (https://phabricator.wikimedia.org/T193187) [17:44:03] I guess phabricator db issues are due to reindexing? [17:45:26] yep [17:45:30] jynus: ah, probably, yes [17:45:37] phabricator is importing alot of refs/meta/* [17:45:46] (03CR) 10Jcrespo: "I am not sure the grant will work without the creation? But I am confused with so the different users, so I don't really understand what i" [puppet] - 10https://gerrit.wikimedia.org/r/438280 (https://phabricator.wikimedia.org/T193187) (owner: 10Bstorm) [17:46:38] woah, new URLs [17:46:39] I am going to ack those [17:46:58] legoktm yep, improves performance [17:47:00] with notedb [17:47:09] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4267699 (10Bstorm) Since videoconvert cleanup, we are at ``` /dev/drbd4 8.0T 6.7T 960G 88% /srv/tools ``` Waiting on a couple cleanup tickets, a... [17:47:11] paladox: I assume no wikibugs changes are necessary for this version? [17:47:15] yep [17:47:28] (03PS2) 10RobH: DNS: ADD mgmt & prod DNS entries for backup2001 [dns] - 10https://gerrit.wikimedia.org/r/438277 (https://phabricator.wikimedia.org/T196477) (owner: 10Papaul) [17:47:55] (03CR) 10RobH: [C: 032] DNS: ADD mgmt & prod DNS entries for backup2001 [dns] - 10https://gerrit.wikimedia.org/r/438277 (https://phabricator.wikimedia.org/T196477) (owner: 10Papaul) [17:47:59] 10Operations, 10Citoid, 10Code-Stewardship-Reviews, 10VisualEditor, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4267700 (10Jrbranaa) @faidon it looks like there's a plan in place to address the pending EOL, in short Marielle and M... [17:48:02] jynus: After last time when it completely fell over, we limited it to one thread for reindexing (so only one repo at a time) [17:48:12] But it just finished puppet and is working on mw/core now, which are large repos. [17:48:21] no_justification: actually, the main replica is not lagging [17:48:24] so that is good [17:48:41] but I guess over WAN it cannot handle it because the extra latency [17:48:51] though the good thing once notedb is done, there should be very low traffic to the db servers. [17:48:58] (yes, phabricator is multi-dc) [17:49:14] I am talking phabricator here, not gerrit [17:49:23] * greg-g nods [17:49:27] thanks jynus [17:49:34] gerrit db seems ok [17:50:11] (03PS2) 10Urbanecm: idprivatewikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 [17:50:22] there is no issue with that [17:50:43] just you should be aware when lag happens a failover is not possible [17:50:47] (03PS3) 10Urbanecm: id_privatewikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 [17:51:05] so we are with degraded redundancy, which is ok on maintenance, as long as you are aware of it [17:51:28] ack [17:51:52] (03PS2) 10Urbanecm: id_privatewikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/438276 (https://phabricator.wikimedia.org/T196747) [17:51:58] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:52:20] (03PS2) 10Urbanecm: id_privatewikimedia: register in DNS [dns] - 10https://gerrit.wikimedia.org/r/438275 (https://phabricator.wikimedia.org/T196747) [17:52:22] (03CR) 10Marostegui: "> I am not sure the grant will work without the creation? But I am" [puppet] - 10https://gerrit.wikimedia.org/r/438280 (https://phabricator.wikimedia.org/T193187) (owner: 10Bstorm) [17:53:00] (03CR) 10Jcrespo: "> > I am not sure the grant will work without the creation? But I am" [puppet] - 10https://gerrit.wikimedia.org/r/438280 (https://phabricator.wikimedia.org/T193187) (owner: 10Bstorm) [17:53:17] RECOVERY - puppet last run on db1102 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:54:30] (03CR) 10Bstorm: "The lasbdbuser role affects the user that this script runs as. I think that's where this ends up needing this." [puppet] - 10https://gerrit.wikimedia.org/r/438280 (https://phabricator.wikimedia.org/T193187) (owner: 10Bstorm) [17:54:55] (03CR) 10Bstorm: "Pretending I spelled that correctly anyway :)" [puppet] - 10https://gerrit.wikimedia.org/r/438280 (https://phabricator.wikimedia.org/T193187) (owner: 10Bstorm) [17:55:14] (03CR) 10Bstorm: [C: 032] wiki replicas: place grant before create in maintain-views script [puppet] - 10https://gerrit.wikimedia.org/r/438280 (https://phabricator.wikimedia.org/T193187) (owner: 10Bstorm) [18:02:33] !log upgrade Cassandra to 3.11.2, restbase1017 & restbase1013 - T178905 [18:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:38] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [18:03:21] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4267727 (10chasemp) I totally forgot to tell @bstorm we have announced the invasive parts of this in the past, old example: https://lists.wikimedia.org/pi... [18:05:24] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507#4267738 (10RobH) p:05Triage>03Normal [18:14:44] 10Operations, 10Analytics, 10hardware-requests: eqiad: (1) new stat box to offload users from stat1005 - https://phabricator.wikimedia.org/T196345#4267759 (10RobH) a:03elukey So, the difference between this request, and our current dual cpu misc spec, is we no longer put in 4 * 4TB disks (like stat1006), b... [18:14:47] 10Operations, 10Analytics, 10hardware-requests: eqiad: (1) new stat box to offload users from stat1005 - https://phabricator.wikimedia.org/T196345#4267762 (10RobH) [18:16:52] 10Operations, 10Analytics, 10hardware-requests: eqiad: (1) new stat box to offload users from stat1005 - https://phabricator.wikimedia.org/T196345#4267765 (10RobH) Please note I'll be away all next week, so if this needs quotation before I return, please chat with @Cmjohnson & @faidon. [18:17:20] 10Operations, 10ops-eqiad, 10Cloud-VPS: upgrade storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651#4267778 (10RobH) [18:19:40] (03PS1) 10Urbanecm: Set wgLocaltimezone to Europe/Rome for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438282 (https://phabricator.wikimedia.org/T196763) [18:24:49] (03PS1) 10Urbanecm: Set a few of namespace aliases on ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438283 (https://phabricator.wikimedia.org/T196719) [18:32:07] 10Operations, 10ops-eqiad: Decommission wmf4195 and wmf4196 - https://phabricator.wikimedia.org/T196766#4267844 (10Cmjohnson) [18:32:31] 10Operations, 10ops-eqiad: Decommission wmf4195 and wmf4196 - https://phabricator.wikimedia.org/T196766#4267856 (10Cmjohnson) [18:32:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4267857 (10Cmjohnson) [18:33:13] (03PS1) 10Cmjohnson: Removing mgmt dns wmf4195 wmf4196 [dns] - 10https://gerrit.wikimedia.org/r/438284 (https://phabricator.wikimedia.org/T196766) [18:34:16] (03PS2) 10Cmjohnson: Removing mgmt dns wmf4195 wmf4196 [dns] - 10https://gerrit.wikimedia.org/r/438284 (https://phabricator.wikimedia.org/T196766) [18:34:40] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns wmf4195 wmf4196 [dns] - 10https://gerrit.wikimedia.org/r/438284 (https://phabricator.wikimedia.org/T196766) (owner: 10Cmjohnson) [18:38:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4267871 (10Cmjohnson) [18:38:58] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission wmf4195 and wmf4196 - https://phabricator.wikimedia.org/T196766#4267869 (10Cmjohnson) 05Open>03Resolved These servers have been wiped and removed from the rack. racktables udpated [18:41:18] (03CR) 10Volans: admin: Port matrix.py to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/438116 (owner: 10Legoktm) [18:45:25] volans: I wonder how much that matters given https://www.python.org/dev/peps/pep-0538/ [18:46:22] (03PS2) 10Legoktm: admin: Port matrix.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/438116 [18:46:35] legoktm: locale.getpreferredencoding(False) gives me 'US-ASCII locally :) [18:46:41] maybe is just me though ;) [18:46:45] (03CR) 10Legoktm: admin: Port matrix.py to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/438116 (owner: 10Legoktm) [18:47:07] volans: what OS/Python version? [18:48:53] gerrit shows errors to me [18:48:54] 'is:wip' operator is not supported by change index version [18:49:05] greg-g: ^ not sure who's around? [18:49:10] answer may be just wait a while. scripts running [18:49:16] !log upgrade Cassandra to 3.11.2, restbase1009 & restbase1014 - T178905 [18:49:17] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [18:49:18] scripts for what? [18:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:21] chad is running the online reindexer [18:49:21] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [18:49:25] legoktm: it's independent of the python version, I get the same with all installed version [18:49:26] what happened? [18:49:27] which will take alot of time [18:49:34] new search indexes post gerrit upgrade paravoid [18:49:34] a gerrit upgrade paravoid [18:49:37] uh? [18:49:42] was that scheduled? [18:50:08] * Reedy shrugs [18:52:37] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:53:58] uh, all gerrit url changed [18:54:08] at least they redirect from the old ones [18:54:36] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/438116 (owner: 10Legoktm) [18:54:37] I get "The page you requested was not found, or you do not have permission to view this page." [18:54:47] (03PS4) 10Urbanecm: id_privatewikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 [18:54:59] hm not for everything [18:55:45] for which url? [18:56:42] https://gerrit.wikimedia.org/r/#/c/433818/ for instance [18:57:40] dunno what it is, it was just on my awesomebar [18:57:56] hmm [18:58:08] 10Operations, 10fundraising-tech-ops, 10netops: adjust NAT mapping for frdata.wikimedia.org - https://phabricator.wikimedia.org/T196656#4267897 (10Jgreen) [18:58:23] I'm surprised by this upgrade (and its downtime) [18:59:11] paravoid oh what surprises you? [18:59:20] was it scheduled? [18:59:34] that would be for chad and greg-g to answer :) [18:59:35] I didn't hear it anywhere, nor saw it announced anywhere, but maybe that's just me [18:59:55] and also it's Friday evening :) [19:00:14] It's Friday Lunch in SF :P [19:00:43] sure ok [19:01:07] It's not on the deployment calendar, though, indeed [19:01:08] paravoid im guessing your problem could be either notedb (it's still running) or it was a draft [19:02:01] paravoid: I don't mean to distract from your point, but there are other deployments happening right now (restbase). Regarding gerrit, it was in-progress this week but was blocked before completion yesterday due to archiva issues. There was minimal downtime and the reindexing should (almost certain) address the issue you're facing (100% certain the wip error message). Sorry it fell over to a [19:02:03] Friday but we needed this upgrade done now before chad leaves. [19:02:33] fwiw, I'm using gerrit with no issues at all. Not sure if I'm doing anything differently [19:03:41] Its change dependent [19:05:26] There's a long online migration. Actual downtime was minutes [19:06:54] (03PS1) 10Cmjohnson: Adding mgmt dns for labstore1008/9 [dns] - 10https://gerrit.wikimedia.org/r/439287 (https://phabricator.wikimedia.org/T193655) [19:18:29] no_justification has a notedb.config file been created in /var/lib/gerrit2/review_site/etc/? [19:18:40] (if it has that's been it's fully completed) [19:21:48] PROBLEM - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:21:57] PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused [19:22:29] ^^^ got that [19:22:34] paladox: not sure [19:22:51] ok [19:22:53] apparently i did not use a long enough scheduled maintenance :/ [19:22:56] no_justification notedb should have done it :) [19:23:21] it does it to shut off database access for changes once everything is in the repos [19:23:32] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused eevans scheduled maintenance expired no problem here [19:23:32] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans scheduled maintenance expired no problem here [19:24:07] RECOVERY - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-b valid until 2018-08-17 16:11:21 +0000 (expires in 69 days) [19:24:08] RECOVERY - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.136 port 9042 [19:34:56] (03PS3) 10Paladox: Gerrit: Set notedb configs to enable notedb [puppet] - 10https://gerrit.wikimedia.org/r/408298 (https://phabricator.wikimedia.org/T174034) [19:35:21] !bash I'm surprised by this upgrade (and its downtime) [19:35:22] framawiki: Stored quip at https://tools.wmflabs.org/bash/quip/AWPg5Wx6wY2u4JUTY7iH [19:35:59] !log upgrade Cassandra to 3.11.2, restbase1015 & restbase1018 - T178905 [19:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:04] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [19:44:28] PROBLEM - MariaDB Slave Lag: m3 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.57 seconds [19:52:22] ACKNOWLEDGEMENT - MariaDB Slave Lag: m3 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 436.62 seconds Jcrespo ongoing phabricator maintenance [20:14:10] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324#4268143 (10mmodell) [20:16:54] 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#4268148 (10Kolossos) Hello, it's cleaned up. BTW: A solution for T184126 would be really nice. [20:27:56] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4268198 (10Bstorm) Following templatetiger cleanup: ``` /dev/drbd4 8.0T 5.6T 2.1T 74% /srv/tools ``` [20:30:38] RECOVERY - MariaDB Slave Lag: m3 on db1117 is OK: OK slave_sql_lag Replication lag: 57.10 seconds [20:40:32] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4268220 (10Cmjohnson) [20:41:04] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4175441 (10Cmjohnson) The raid has been setup as well. Raid 10 256k Stripe on the server and disk shelf. [20:45:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom/reclaim tin - https://phabricator.wikimedia.org/T196175#4268239 (10RobH) [20:47:22] 10Operations, 10hardware-requests: Replacement hardware for cumin masters - https://phabricator.wikimedia.org/T178392#4268240 (10RobH) 05stalled>03Resolved a:03RobH [20:48:54] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477#4268242 (10Papaul) [20:51:35] 10Operations, 10ops-codfw, 10netops: Switch port configuration for backup2001 - https://phabricator.wikimedia.org/T196782#4268246 (10Papaul) p:05Triage>03Normal [22:09:33] 10Operations, 10Cloud-VPS, 10DNS, 10Beta-Cluster-reproducible: Create some mechanism for instances in projects to modify the project Designate records - https://phabricator.wikimedia.org/T184245#4268365 (10Krenair) [22:09:36] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Create custom deployment-prep role that allows editing of Designate records only - https://phabricator.wikimedia.org/T194998#4268363 (10Krenair) 05Open>03Resolved a:03Andrew [22:10:09] 10Operations, 10Cloud-VPS, 10DNS, 10Beta-Cluster-reproducible: Create some mechanism for instances in projects to modify the project Designate records - https://phabricator.wikimedia.org/T184245#3877216 (10Krenair) 05Open>03Resolved a:03Andrew see child ticket [22:26:35] no_justification has it completed reindexing? :) [22:27:37] I'm still getting is:wip errors so I assume not [22:32:21] yeh [22:32:25] just making sure :) [22:34:08] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:18] paladox: You could also check from show-queue -w [22:34:19] ;-) [22:34:24] ohhh [22:34:28] * paladox does that :) [22:34:32] 357 tasks to go [22:34:46] 325 [22:35:04] 296 [22:35:07] 287 [22:35:10] 280 [22:35:12] etc [22:35:24] thanks [22:35:35] didn't realise we could run show-queue as mere mortals [22:35:59] i guess you need to be in gerrit-managers [22:36:02] or be a admin [22:36:36] I'm not any of those things but I can show-queue [22:36:54] it doesn't show the high number of tasks that no_justification just posted [22:37:07] either meaning they all got done before I ran it [22:37:09] or they're not visible to me [22:37:12] 92 tasks now [22:37:27] $ ssh gerrit gerrit show-queue -w [22:37:27] Task State StartTime Command [22:37:27] ------------------------------------------------------------------------------ [22:37:27] ------------------------------------------------------------------------------ [22:37:27] 0 tasks [22:37:33] oh [22:37:39] i doin't think you can see the index [22:37:47] 39 [22:38:03] That's ... not useful [22:38:14] You'd think it would provide an error [22:38:16] Not pretend there's 0 [22:38:22] it does sometimes show entries [22:38:30] like earlier I got [22:38:36] fdf0c759 22:37:16.560 mediawiki/extensions/WikimediaMaintenance [22:38:36] 7d405756 22:37:16.561 mediawiki/extensions/WikidataPageBanner [22:38:36] bd41ef5d 22:37:16.563 mediawiki/extensions/WikimediaEvents [22:38:36] 1d4fdb89 22:37:16.564 mediawiki/extensions/WikimediaMessages [22:38:37] fd4be77b 22:37:16.565 mediawiki/extensions/WikimediaIncubator [22:38:49] but that was the most I saw after running in a few times in the last couple of minutes [22:38:50] hmm it shows no tasks now (except 4) but not index [22:38:59] and still get 'is:wip' operator is not supported by change index version [22:39:10] I think maybe it's just returning a limited subset of tasks in the queue to us [22:39:21] We should add com.google.gerrit.server.auth.NoSuchUserException to suppressed errors. [22:39:27] That's f'ing useless to log [22:39:54] Hmmm. That's.... [22:39:58] Why is still erroring? [22:40:32] hmm [22:40:51] Restart service? [22:41:10] Maybe its knowledge of what operators it can handle is cached in memory? [22:41:26] We loaded available operators, which didn't include wip because index wasn't migrated [22:41:30] And now they're stuck? [22:41:50] !log gerrit: restarting [22:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:21] hmm, i guess so [22:44:00] Nope, no bueno [22:44:17] i get "503 Service Unavailable" [22:44:24] as a popup [22:44:38] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [22:44:40] though polygerrit show Server error: 'is:wip' operator is not supported by change index version [22:44:44] so i guess it was cached [22:44:55] no_justification did you use --force when doing the reindex? [22:45:06] Oh. No.... [22:45:27] we should try --force. [22:45:40] it only forces the online reindexer to start [22:45:48] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [22:45:54] but this is strange how it keeps saying "Server error: 'is:wip' operator is not supported by change index version" [22:47:21] [2018-06-08 22:47:07,590] [SSH gerrit index start --force changes (demon)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user demon account 2) during gerrit index start --force changes [22:47:21] java.lang.NullPointerException [22:47:28] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [22:47:29] hmm [22:47:36] that's new [22:47:39] I have a full stacktrace but that doesn't help now [22:47:40] :\ [22:48:06] hmmmm https://www.irccloud.com/pastebin/zxlXqfYd/gerrit_stacktrace [22:48:08] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [22:49:10] https://github.com/GerritCodeReview/gerrit/blob/stable-2.15/gerrit-server/src/main/java/com/google/gerrit/server/index/OnlineReindexer.java#L93 [22:49:39] ) during git-receive-pack '/apps/android/wikipedia.git' [22:49:39] java.lang.NullPointerException [22:49:39] at com.google.gerrit.server.index.change.ChangeIndexRewriter.isIndexPredicate(ChangeIndexRewriter.java:245) [22:49:45] Something's up with the index version #s [22:49:58] oh [22:50:04] no_justification does offline work? [22:50:17] Probably. [22:50:23] I've always trusted it more [22:50:32] ok [22:50:48] * paladox hopes that will fix our problem, it should be quicker at least i think? [22:51:06] https://www.irccloud.com/pastebin/H6Gcg9Sf/ [22:51:12] Getting those everywhere too..... [22:51:38] it should be reading from the db until it's done them all [22:51:41] so safe for now [22:51:47] (it's writing to both backends) [22:52:17] https://groups.google.com/forum/#!topic/repo-discuss/OKXf_v-W4go [22:54:07] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:54:37] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [22:54:47] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.85 and port 29418: Connection refused [22:54:48] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [22:55:34] Ack all of those ^ [22:55:58] PROBLEM - puppet last run on db1102 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [22:56:56] How many projects we got again? [22:57:08] PROBLEM - puppet last run on db2094 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [22:57:11] um i think more then 999 [22:59:20] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:00:03] so we're doing an offline reindex now? [23:00:27] Collecting projects: 2046 [23:00:28] Yeah [23:00:31] It'll go faster [23:01:54] offline is always faster than online [23:02:00] I've never trusted online [23:02:04] But I prefer uptime [23:02:10] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] [23:02:36] paladox: You're right about --force [23:02:43] I think that's why online /never/ works [23:02:47] Because old changes stay stale [23:02:48] heh :) [23:02:50] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [23:02:50] yeh [23:04:09] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:04:10] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 5 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [23:04:10] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [23:04:29] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 6 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [23:04:30] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:04:31] paladox: Worth filing an issue over online indexing w/o force? [23:04:35] Like, it should force [23:04:38] yeh [23:04:40] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:04:42] * paladox files one now [23:05:08] I'm only using 4 threads cuz I don't wanna hammer m2 [23:05:12] Or risk it, at least [23:05:54] Also: online reindexing of changes seems to go by project. Offline goes just by changes themselves [23:05:56] It's kind of weird [23:06:10] Maybe they're identical, the reporting is not [23:06:11] time to take myself out of paging for vacation \o/ [23:06:40] * robh knows better than to just do it and ensures there is a clean puppet run on icinga with no other pending icinga changes [23:06:49] PROBLEM - puppet last run on db2095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:07:11] robh: Any chance you can ack those failures? [23:07:12] <3 [23:07:14] Pretty please? [23:07:19] A dying man's last request? [23:07:19] uh, why are they failing? [23:07:20] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:07:25] gerrit is offline [23:07:26] Had to take gerrit down [23:07:26] that seems bad [23:07:27] oh [23:07:29] For offline reindex [23:07:31] those resources try to pull from gerrit [23:07:31] They're fine [23:07:38] no_justification https://bugs.chromium.org/p/gerrit/issues/detail?id=9219 [23:07:40] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:07:42] (hence, why I wanted to start pulling from a slave or mirror) [23:07:50] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:08:08] so puppets going to fail on all of those until gerrit is back? [23:08:10] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [23:08:23] Anything pulling a repo, basically git::clone{} stanzas [23:08:38] oh, do we have a list of all those servers (are they a specific service group?) [23:08:40] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [23:08:44] No :( [23:08:48] Any role could have it way down [23:08:50] i mean, i can ack the ones i see now but it seems they're going to keep failing for the next 15m or so right? [23:09:14] how long will it be down? [23:09:22] It's gonna be a bit :( [23:09:29] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:09:35] a bit = 1 hour, 5+ hours, 1+ day? [23:09:40] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] [23:09:40] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:10:08] 2-3 h? [23:10:24] Long enough I don't wanna start waking folks who go to bed in other timezones [23:10:24] dammmnnn [23:10:25] so, uh, we have no way to know, and i think its bad to just disable puppet monitoring fleetwide [23:10:29] these dont page [23:10:33] pupept failures never ever page [23:10:34] Oh ok [23:10:36] =] [23:10:36] does it make sense to publish an announcement that gerrit is going to be down for several hours? [23:10:46] SMalyshev: yes [23:10:50] no_justification: ? [23:10:51] puppet failure is bad but not a critical wake up folks event. if its going to stay that way all weekend though [23:10:54] it may become critical [23:10:58] parting announce email? [23:11:13] greg-g: "Today's my last day. Oh and Gerrit's down for a few hours. kthnxbai" [23:11:34] oh, damn, i guess im not taking myself out of paging righ tnow ;] [23:11:40] ill ask someone to do it monday, hehe [23:11:42] paladox: Wait, task you filed....you can run online indexer w/ force? [23:11:46] Then why am I doing offline?! [23:11:55] no_justification because it failed for you? [23:11:58] using the online indexer [23:12:14] Well, it ran the first time. Failed later [23:12:17] [2018-06-08 22:47:07,590] [SSH gerrit index start --force changes (demon)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user demon account 2) during gerrit index start --force changes [23:12:19] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:12:27] I ran activate in between [23:12:33] That's when the NPE's started [23:12:35] oh [23:12:42] Idk why that command exists anyway [23:12:50] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 6 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [23:14:46] oh, ill just change it to email only, problem solved (in private repo hence no gerrit [23:14:50] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:15:12] So, I can ack every single one as they come up in icinga, but as they dont cause any pages, not sure thre is a point in doing so in icinga for these [23:15:28] since just disabling puppet check fleet wide seems bad, and otherwise we dont know what has the failure until they fail. [23:16:04] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [23:16:20] and my change didnt break icinga, reloaded my new config successfully. [23:24:43] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:25:32] Reindexing changes: projects: 0% (18/2046), 42% (181629/432396) (-) [23:25:37] Just shows you the longggggg tail [23:25:47] wow [23:25:52] only on the 18 project [23:25:54] It's going quickly, cuz we're almost halfway [23:26:03] But we started with the big ones it looks like [23:26:05] At least one [23:26:12] yep [23:26:18] 4 threads remember, so only 4 projects at once [23:26:35] :) [23:28:09] Reindexing changes: projects: 1% (28/2046), 48% (207552/432396) (\) [23:28:35] heh [23:30:11] robh: Ack'ing the ones re: cobalt itself likely a good idea. [23:30:19] daemon, ssh etc down [23:31:00] done [23:31:04] ACKNOWLEDGEMENT - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. rhalsell RI team working on gerrit [23:31:05] ACKNOWLEDGEMENT - SSH access on cobalt is CRITICAL: connect to address 208.80.154.85 and port 29418: Connection refused rhalsell RI team working on gerrit [23:31:05] ACKNOWLEDGEMENT - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site rhalsell RI team working on gerrit [23:32:01] 59% of changes done [23:32:28] :) [23:34:48] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Create custom deployment-prep role that allows editing of Designate records only - https://phabricator.wikimedia.org/T194998#4268414 (10Peachey88) [23:44:01] Reindexing changes: projects: 51% (1050/2046), 97% (421699/432396) (/) [23:44:10] so half way :) [23:44:17] No, 97% [23:44:35] You're reindexing /changes/ not projects [23:44:48] So the projects # is mostly just interesting for showing you the distribution of changes/project [23:44:57] Reindexing changes: projects: 99% (2045/2046), 99% (430550/432396) (\) [23:44:58] See ^ [23:45:13] oh [23:45:22] i see now [23:45:30] Less than 1k changes to go [23:45:41] :) [23:46:24] Reindexed 432214 documents in changes index in 2925.8s (147.7/s) [23:46:28] That's not too bad imho [23:46:32] :) [23:46:54] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [23:48:00] Address already in use for sshd? [23:48:03] ....wut? [23:48:38] heh [23:49:59] I don't see anything listening on 29418 [23:50:10] hmm [23:50:16] Cannot bind to gerrit.wikimedia.org:29418, gerrit.wikimedia.org:29418 [23:50:18] Why twice? [23:50:23] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:50:24] uh [23:50:40] Oh, we had two listenAddress entries [23:50:44] I blame init [23:50:46] oh [23:50:47] lol [23:51:24] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [23:51:50] woo [23:51:53] it works no_justification! [23:51:56] dashboard [23:51:57] [2018-06-08 23:51:43,671] [OnlineNoteDbMigrator] ERROR com.google.gerrit.server.notedb.rebuild.NoteDbMigrator : Error migrating primary storage for 3850 [23:51:57] com.google.gerrit.server.notedb.PrimaryStorageMigrator$NoNoteDbStateException: change 3850 has no note_db_state; rebuild it first [23:51:57] Still lots of those [23:52:03] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.2-2-g33ec53f938 (SSHD-CORE-1.6.0) (protocol 2.0) [23:52:04] hmm [23:52:11] Yay, but we're back [23:52:13] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [23:52:14] 3850 does not exist [23:52:21] Ah, there's a bunch of them that DNE [23:52:21] so i suppose it is that it was deleted [23:52:26] I guess changes that were deleted? [23:52:29] yeh [23:52:37] 18683 [23:52:39] Etc [23:52:52] there's a thread [23:52:53] * paladox looks for it [23:53:45] no_justification https://groups.google.com/forum/#!topic/repo-discuss/OKXf_v-W4go [23:54:44] no_justification https://gerrit.wikimedia.org/r/dashboard/chadh@wikimedia.org heh [23:54:53] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:55:14] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:55:39] paladox: Why would you have a url like that? [23:55:55] no_justification gerrit allows you to view what users see in there dashboard [23:55:59] new feature in 2.15 [23:56:14] like https://gerrit.wikimedia.org/r/dashboard/thomasmulhall410@yahoo.com [23:56:37] (and gerrit links to it) [23:57:01] from https://gerrit.wikimedia.org/r/q/owner:%2522Paladox+%253Cthomasmulhall410%2540yahoo.com%253E%2522 [23:57:54] RECOVERY - puppet last run on db2094 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [23:58:20] Ah [23:58:33] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [23:58:51] you can use usernames too [23:59:43] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [23:59:53] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [23:59:57] yay [23:59:57] (03CR) 10Paladox: "😊" [puppet] - 10https://gerrit.wikimedia.org/r/408298 (https://phabricator.wikimedia.org/T174034) (owner: 10Paladox)