[00:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180614T0000). [00:31:34] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a [00:31:35] ved [00:33:54] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [00:48:54] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [00:52:14] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:16:44] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title [02:16:44] ected status 503 (expecting: 404) [02:17:45] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [02:18:44] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [02:22:04] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:22:05] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.7) (duration: 09m 41s) [02:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:43] (03PS15) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [02:24:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [02:55:49] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.8) (duration: 14m 42s) [02:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:19] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Jun 14 03:06:19 UTC 2018 (duration 10m 30s) [03:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:14] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [03:22:34] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:58:39] (03PS16) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [03:59:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [04:01:29] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [04:02:38] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [04:09:48] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (e [04:12:58] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [04:13:08] (03PS17) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [04:14:01] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [04:41:40] (03PS18) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [04:41:44] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [04:49:04] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [04:50:11] (03PS19) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [04:50:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [04:52:24] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:56:41] (03PS20) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [04:57:26] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [05:03:11] !log Deploy schema change on s4 primary master (db1068) T191316 T192926 T195193 [05:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:18] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:03:19] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:03:19] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:05:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440265 [05:05:16] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440265 [05:05:56] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4281738 (10Marostegui) 05Open>03Resolved a:03Papaul All good! Thank you! ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physi... [05:07:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440265 (owner: 10Marostegui) [05:08:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440265 (owner: 10Marostegui) [05:09:00] (03PS21) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [05:09:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440265 (owner: 10Marostegui) [05:09:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [05:10:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1105:3311 after alter table (duration: 01m 01s) [05:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440266 (https://phabricator.wikimedia.org/T191316) [05:12:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440266 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:13:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440266 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:13:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440266 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:14:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1083 for alter table (duration: 00m 58s) [05:15:02] !log Deploy schema change on db1083 T191316 T192926 T89737 T195193 [05:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:09] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:15:09] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:15:09] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:15:10] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:20:46] (03PS22) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [05:21:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [05:46:22] (03PS23) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [05:47:10] !log LDAP - added user mepps to wmf group (T192472) [05:47:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [05:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:15] T192472: JupyterHub access for meps not working (was: Requesting access to analytics servers for mepps) - https://phabricator.wikimedia.org/T192472 [05:48:05] 10Operations, 10Analytics, 10Jupyter-Hub, 10SRE-Access-Requests: JupyterHub access for meps not working (was: Requesting access to analytics servers for mepps) - https://phabricator.wikimedia.org/T192472#4281768 (10Dzahn) @mepps Try again now. I added you to "wmf". Seems you were actually _not_ in that yet... [05:52:13] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4281781 (10Marostegui) This is great! Thanks for getting on with this - these are my first thoughts! > Depool/pool/warmup >dbconfig [depo... [05:54:23] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4281782 (10Marostegui) p:05Triage>03Normal [05:57:48] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4281783 (10Marostegui) 05Open>03Resolved a:03mmodell This is now back to normal values: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var... [05:59:57] (03CR) 10EBernhardson: "puppet compiler against various prod elasticsearch servers (master and non-master capable) in our clusters, against PS23: https://puppet-c" [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [06:02:36] (03CR) 10EBernhardson: "looks like data paths are being incorrectly defaulted for some hosts as well. not seeing /srv/elasticsearch make it into any of them," [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [06:06:17] (03PS1) 10Dzahn: cache::misc: switch backend for dbtree from terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440267 (https://phabricator.wikimedia.org/T192092) [06:07:29] (03PS2) 10Dzahn: cache::misc: switch backend for dbtree from terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440267 (https://phabricator.wikimedia.org/T192092) [06:09:43] (03PS1) 10Dzahn: tendril: add grants for tendril_web from mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440268 (https://phabricator.wikimedia.org/T192092) [06:11:15] 10Operations, 10Cassandra, 10Maps-Sprint, 10Scap: cassandra/metrics-collector does not deploy with scap on a new install - https://phabricator.wikimedia.org/T197159#4281799 (10Gehel) Looking on `deploy1001`, I see that `/srv/deployment/cassandra/metrics-collector/.git/DEPLOY_HEAD` also has a reference to `... [06:11:33] 10Operations, 10Cassandra, 10Maps-Sprint, 10Release-Engineering-Team, 10Scap: cassandra/metrics-collector does not deploy with scap on a new install - https://phabricator.wikimedia.org/T197159#4281800 (10Gehel) [06:11:39] marostegui: could i have another mariadb grant please. it's for migrating dbtree.wm.org (still used, right) from terbium to mwmaint1001 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440268/ [06:11:58] checking [06:12:00] (yea, dbtree uses the tendril_web user and hosted on mw-maint) [06:12:05] (03PS2) 10Elukey: cassandra: all 2.2 clusters should use the cassandra22 APT component [puppet] - 10https://gerrit.wikimedia.org/r/440164 (owner: 10Gehel) [06:14:26] mutante: feel free to merge, I will create the grants on the db [06:14:44] marostegui: yay:) thanks [06:14:49] (03CR) 10Gehel: [C: 04-1] cassandra: all 2.2 clusters should use the cassandra22 APT component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440164 (owner: 10Gehel) [06:14:59] (03CR) 10Dzahn: [C: 032] tendril: add grants for tendril_web from mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440268 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [06:15:30] Done from my side, check whenever you have time :) [06:16:44] marostegui: confirmed working now :) [06:16:50] \o/ [06:17:07] i suppose i can move dbtree.wm anytime now [06:17:17] just have to switch in cache:misc [06:24:35] 10Operations, 10Wikimedia-Planet: Only include the last e.g. 6 months of news - https://phabricator.wikimedia.org/T196965#4281815 (10Samwilson) Oh yes, I didn't look close enough at the source before; it is indeed RSS (as much as anything ever is)! :-) It's that old thing of the description tag not being very... [06:27:07] (03PS3) 10Elukey: cassandra: all 2.2 clusters should use the cassandra22 APT component [puppet] - 10https://gerrit.wikimedia.org/r/440164 (owner: 10Gehel) [06:27:28] 10Operations, 10Wikimedia-Planet: Only include the last e.g. 6 months of news - https://phabricator.wikimedia.org/T196965#4281821 (10Dzahn) >>! In T196965#4281815, @Samwilson wrote: > I've often wondered about a tool to import RSS articles into MediaWiki (along with images etc.)... There is the Mediawiki RSS... [06:27:31] elukey: damn... you're doing all my work :) [06:27:35] thanks! [06:28:13] gehel: I was only trying to update the patch since I added those nits, that's it :) [06:28:51] elukey: he, he, he... no problem! If you want to have a look at the rest of my todo list, feel free to take on any task you'd like :) [06:29:03] :) [06:35:31] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.086 second response time [06:38:17] ^ there are 2 patches .. one to up the timeout from 5 to 10m..and one to outright stop wikidata dispatching [06:38:40] (the second is just gor migrating it to new server) [06:39:54] (03PS3) 10Dzahn: Switch from 5 mins to 10 mins for wikidata dispatch check [puppet] - 10https://gerrit.wikimedia.org/r/439528 (https://phabricator.wikimedia.org/T194602) (owner: 10Addshore) [06:40:19] doing the first one.. but only temporary .. per comment on gerrit [06:40:44] will meet wikidata people in person today and discuss it [06:41:53] (03CR) 10Dzahn: [C: 032] "temporary measure" [puppet] - 10https://gerrit.wikimedia.org/r/439528 (https://phabricator.wikimedia.org/T194602) (owner: 10Addshore) [06:42:35] (03PS2) 10Dzahn: mediawiki: Stop Wikidata dispatching [puppet] - 10https://gerrit.wikimedia.org/r/440142 (https://phabricator.wikimedia.org/T192092) (owner: 10Ladsgroup) [06:44:57] (03PS2) 10Elukey: profile::hadoop::common: expand journal nodes from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/440130 (https://phabricator.wikimedia.org/T189105) [06:52:39] (03PS3) 10Vgutierrez: vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) [06:53:14] mutante: terbium dies today? :) [06:53:53] thanks for that puppet merge, I think I have got 3,000,000 emails in the past few weeks [07:02:16] addshore: yea, now the plan is that i go to the WMDE office and meet Ladsgroup and then we switch that part [07:02:24] (in Berlin) [07:02:40] oooooh, coool :D [07:02:50] "die" is a little much to call it.. but yea.. give it the "spare"role [07:03:32] addshore: i am unsure if disabling it should happen right before it or some hours before https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440142/2/modules/mediawiki/manifests/maintenance/wikidata.pp [07:04:03] quote "Hey, Please merge and deploy this before switching to mwmaint1001 (it will probably cause some alarms to scream) and then revert it once the new node online, otherwise locks got stale and has to expire causing even more disruptions." [07:04:24] there is a log file we can tail :) [07:04:40] it should probably also be absented on testwiki too tbh [07:05:12] ok, cool [07:05:40] mutante: the max run of each script is 720 seconds https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/Wikibase.php#L113 [07:06:08] so, run puppet to disable, wait 12 mins and you should be fine [07:06:20] great. thanks for confirming [07:07:06] /var/log/wikidata/dispatchChanges-wikidatawiki.log is normally a pretty active log file [07:07:53] although to be honest, if you turn the cron off, you can just leave the process to finish and start them on the new host no problem, just don't kill them on terbium as then we will have some locks floating around :) [07:09:43] ok, gotcha. no need to turn it off right now then.. if we switch in a couple hours [07:09:56] nope [07:28:49] (03PS1) 10Muehlenhoff: Record new MOU for nithum [puppet] - 10https://gerrit.wikimedia.org/r/440271 [07:29:33] (03CR) 10Muehlenhoff: [C: 032] Record new MOU for nithum [puppet] - 10https://gerrit.wikimedia.org/r/440271 (owner: 10Muehlenhoff) [07:32:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440272 [07:34:27] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440272 (owner: 10Marostegui) [07:35:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440272 (owner: 10Marostegui) [07:37:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440273 (https://phabricator.wikimedia.org/T191316) [07:37:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1083 after alter table (duration: 00m 58s) [07:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:05] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440272 (owner: 10Marostegui) [07:38:36] (03PS1) 10Volans: Use sha256 as checksum for the client self-update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440275 (https://phabricator.wikimedia.org/T191300) [07:38:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440273 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:39:48] (03CR) 10jerkins-bot: [V: 04-1] Use sha256 as checksum for the client self-update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440275 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:40:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440273 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:41:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1114 for alter table (duration: 00m 58s) [07:41:38] !log Deploy schema change on db1114 T191316 T192926 T89737 T195193 [07:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:47] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [07:41:47] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [07:41:48] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [07:41:48] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [07:42:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440273 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:49:16] (03CR) 10Muehlenhoff: [C: 031] Use sha256 as checksum for the client self-update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440275 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:53:19] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4281980 (10ovasileva) p:05Normal>03High [07:56:34] PROBLEM - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1961 bytes in 0.077 second response time [07:58:33] (03CR) 10Elukey: [C: 032] profile::hadoop::common: expand journal nodes from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/440130 (https://phabricator.wikimedia.org/T189105) (owner: 10Elukey) [07:58:35] (03PS1) 10Addshore: BETA: wikidata, lexeme, enable senses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440279 (https://phabricator.wikimedia.org/T195652) [07:58:38] (03PS3) 10Elukey: profile::hadoop::common: expand journal nodes from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/440130 (https://phabricator.wikimedia.org/T189105) [07:58:40] (03CR) 10Elukey: [V: 032 C: 032] profile::hadoop::common: expand journal nodes from 3 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/440130 (https://phabricator.wikimedia.org/T189105) (owner: 10Elukey) [07:58:59] !log resuming rolling restart of cassandra on restbase2* to pick up OpenJDK security update [07:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:39] (03CR) 10Addshore: [C: 032] BETA: wikidata, lexeme, enable senses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440279 (https://phabricator.wikimedia.org/T195652) (owner: 10Addshore) [08:01:24] (03Merged) 10jenkins-bot: BETA: wikidata, lexeme, enable senses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440279 (https://phabricator.wikimedia.org/T195652) (owner: 10Addshore) [08:02:15] (03CR) 10Vgutierrez: [C: 031] "bonus points available, LGTM otherwise :)" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440275 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [08:03:42] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: BETAONLY: [[gerrit:440279|senses for beta wikidatas]] (duration: 00m 58s) [08:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:55] PROBLEM - DPKG on analytics1040 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:05:55] RECOVERY - DPKG on analytics1040 is OK: All packages OK [08:06:40] (03PS3) 10Dzahn: cache::misc: switch backend for dbtree from terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440267 (https://phabricator.wikimedia.org/T192092) [08:07:48] !log roll restart of hadoop journal nodes to pick up the new configuration (two more journal nodes added) [08:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:55] (03CR) 10Gehel: "puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler02/11507/" [puppet] - 10https://gerrit.wikimedia.org/r/440164 (owner: 10Gehel) [08:08:00] (03PS4) 10Gehel: cassandra: all 2.2 clusters should use the cassandra22 APT component [puppet] - 10https://gerrit.wikimedia.org/r/440164 [08:08:18] (03CR) 10Dzahn: [C: 032] "requested and received new mariadb grants for it, confirmed they work, tested Apache part with apache-fast-test from deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/440267 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [08:08:54] !log switch backend for dbtree.wikimedia.org away from terbium to mwmaint1001 (T192092) [08:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:59] T192092: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092 [08:10:33] (03CR) 10Gehel: [C: 032] cassandra: all 2.2 clusters should use the cassandra22 APT component [puppet] - 10https://gerrit.wikimedia.org/r/440164 (owner: 10Gehel) [08:10:56] (03PS5) 10Gehel: cassandra: all 2.2 clusters should use the cassandra22 APT component [puppet] - 10https://gerrit.wikimedia.org/r/440164 [08:15:39] (03CR) 10Gehel: [C: 032] "Looks good, checksums validated" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/440249 (owner: 10DCausse) [08:15:54] (03CR) 10Gehel: [V: 032 C: 032] Bump extra version to 5.5.2.7 and highlighter version to 5.5.2.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/440249 (owner: 10DCausse) [08:16:35] RECOVERY - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1948 bytes in 0.065 second response time [08:19:32] (03PS1) 10Legoktm: Bump ExtensionDistributor default to REL1_31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440283 [08:22:38] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282045 (10jcrespo) > I think that if NAME is just the hostname, it should show the config for all the configured HOST:PORT combinations i... [08:22:39] (03PS2) 10Ayounsi: Add Icinga alert for varnish http traffic drop [puppet] - 10https://gerrit.wikimedia.org/r/440069 [08:22:53] (03CR) 10jerkins-bot: [V: 04-1] Add Icinga alert for varnish http traffic drop [puppet] - 10https://gerrit.wikimedia.org/r/440069 (owner: 10Ayounsi) [08:24:37] (03PS1) 10Dzahn: Revert "cache::misc: switch backend for dbtree from terbium to mwmaint1001" [puppet] - 10https://gerrit.wikimedia.org/r/440284 [08:25:28] (03CR) 10Dzahn: [C: 032] "dbtree uses mysql_connect(). That doesn't exist anymore with PHP7 on stretch. So "PHP Fatal error: Uncaught Error: Call to undefined func" [puppet] - 10https://gerrit.wikimedia.org/r/440284 (owner: 10Dzahn) [08:25:49] (03PS2) 10Dzahn: Revert "cache::misc: switch backend for dbtree from terbium to mwmaint1001" [puppet] - 10https://gerrit.wikimedia.org/r/440284 [08:26:15] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282069 (10Joe) >>! In T197126#4280521, @Volans wrote: > Quick first feedback/questions on the proposal: > >> dbconfig get NAME gets you... [08:26:19] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282070 (10Joe) [08:26:22] (03PS2) 10Marostegui: mariadb: Set db1095 as spare, remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/437720 (https://phabricator.wikimedia.org/T196376) [08:27:48] (03CR) 10Ayounsi: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11508/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/440069 (owner: 10Ayounsi) [08:28:06] (03PS3) 10Ayounsi: Add Icinga alert for varnish http traffic drop [puppet] - 10https://gerrit.wikimedia.org/r/440069 [08:28:40] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282078 (10Volans) @Joe ack to all your replies, thanks for integrating the suggestions! [08:28:43] (03CR) 10jerkins-bot: [V: 04-1] Add Icinga alert for varnish http traffic drop [puppet] - 10https://gerrit.wikimedia.org/r/440069 (owner: 10Ayounsi) [08:29:03] (03CR) 10Ayounsi: [V: 032 C: 032] Add Icinga alert for varnish http traffic drop [puppet] - 10https://gerrit.wikimedia.org/r/440069 (owner: 10Ayounsi) [08:29:14] (03PS3) 10Marostegui: mariadb: Set db1095 as spare, remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/437720 (https://phabricator.wikimedia.org/T196376) [08:29:43] (03CR) 10jenkins-bot: BETA: wikidata, lexeme, enable senses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440279 (https://phabricator.wikimedia.org/T195652) (owner: 10Addshore) [08:29:58] !log restart of elasticsearch / relforge for plugin updates [08:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:14] !log restart hadoop hdfs master nodes to pick up the new journal node settings [08:31:14] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282085 (10Joe) >>! In T197126#4282045, @jcrespo wrote: > After thinking for a while, `pool|depool|warmup` (as an interface, not as the id... [08:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:19] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282100 (10Joe) [08:33:52] PROBLEM - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.067 second response time [08:35:52] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282109 (10Joe) >>! In T197126#4281781, @Marostegui wrote: > This is great! Thanks for getting on with this - these are my first thoughts!... [08:36:57] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282110 (10Joe) [08:42:33] 10Operations, 10DBA, 10Traffic, 10WMF-Legal, and 3 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#4282120 (10Bawolff) [08:44:38] (03CR) 10Volans: "reply inline" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440275 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [08:44:41] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [08:46:10] (03CR) 10Giuseppe Lavagetto: [C: 032] Add the capability to check for deprecated defines [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433162 (https://phabricator.wikimedia.org/T194724) (owner: 10Giuseppe Lavagetto) [08:46:32] (03CR) 10Giuseppe Lavagetto: [C: 032] Check for all the available variants of a hiera call [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433163 (owner: 10Giuseppe Lavagetto) [08:50:36] (03PS2) 10ArielGlenn: generate temp stubs for page ranges serially from same input stub file [dumps] - 10https://gerrit.wikimedia.org/r/436956 (https://phabricator.wikimedia.org/T196063) [08:51:05] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282151 (10Marostegui) >>! In T197126#4282109, @Joe wrote: >> >> This is not likely to happen in a near future, but as we are starting fr... [08:54:11] (03PS1) 10MarcoAurelio: beta: declare beta sr.wikipedia and beta crh.wikipedia to langlist-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440289 [08:58:02] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@2680ba8]: Decrease the checkerJob delays recheck to 10 minutes [08:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:02] RECOVERY - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1967 bytes in 0.098 second response time [08:59:20] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@2680ba8]: Decrease the checkerJob delays recheck to 10 minutes (duration: 01m 18s) [08:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:41] PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused [09:00:01] PROBLEM - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:01:25] I'll take a look at cassandra [09:02:02] RECOVERY - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-a valid until 2018-08-17 16:11:49 +0000 (expires in 64 days) [09:03:01] RECOVERY - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.134 port 9042 [09:03:14] (03PS1) 10Giuseppe Lavagetto: Fix gemspec warnings [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/440293 [09:03:31] (03CR) 10jerkins-bot: [V: 04-1] Fix gemspec warnings [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/440293 (owner: 10Giuseppe Lavagetto) [09:03:52] expired downtime, fixing [09:04:12] kk, thanks [09:04:31] PROBLEM - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.135 and port 9042: Connection refused [09:04:32] PROBLEM - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:04:44] 10Operations, 10Wikimedia-Logstash, 10Services (watching): Logstash started showing full serialized log entry as a message - https://phabricator.wikimedia.org/T197219#4282186 (10Pchelolo) [09:06:41] (03PS1) 10Giuseppe Lavagetto: Use the newer version of the wmf_styleguide check [puppet] - 10https://gerrit.wikimedia.org/r/440294 [09:07:38] (03CR) 10jerkins-bot: [V: 04-1] Use the newer version of the wmf_styleguide check [puppet] - 10https://gerrit.wikimedia.org/r/440294 (owner: 10Giuseppe Lavagetto) [09:09:02] RECOVERY - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-b valid until 2018-08-17 16:11:50 +0000 (expires in 64 days) [09:09:52] RECOVERY - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.135 port 9042 [09:14:17] <_joe_> I hate you jenkins [09:15:02] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:15:36] !log add debmonitor term to analytics-in4 on cr1/cr2 eqiad [09:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:42] Cc: XioNoX --^ [09:16:07] elukey: task#? [09:16:57] T191300 [09:16:57] T191300: Debmonitor: deploy the agent across the fleet - https://phabricator.wikimedia.org/T191300 [09:17:03] (I am doing cr2 atm) [09:19:53] (done) [09:26:49] thx [09:27:19] elukey: I don't see the term you're pushing in the task :) [09:27:54] XioNoX: I was completing another hadoop maintenance, gimme a min and I'll add it [09:27:57] :) [09:28:04] no rush, thanks [09:29:12] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: deploy the agent across the fleet - https://phabricator.wikimedia.org/T191300#4100643 (10elukey) Added the following term to the analytics-in4 filter on cr1/2-eqiad to allow debmonitor to operate in the analytics vlan: ``` eluk... [09:29:15] XioNoX: --^ [09:29:35] seems to work fine from Moritz's tests [09:30:03] (03PS2) 10Giuseppe Lavagetto: Use the newer version of the wmf_styleguide check [puppet] - 10https://gerrit.wikimedia.org/r/440294 [09:30:05] (03PS1) 10Giuseppe Lavagetto: Fix remaining rubocop violations [puppet] - 10https://gerrit.wikimedia.org/r/440296 [09:30:08] ack, confirmed working from analytics1040 [09:30:13] ok, will update automation repo [09:31:41] PROBLEM - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.086 second response time [09:31:59] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: deploy the agent across the fleet - https://phabricator.wikimedia.org/T191300#4282285 (10Volans) Thanks a lot @elukey! [09:36:10] elukey: adding a v6 term as well? [09:37:41] XioNoX: shouldn't we have analytics-in6 first ? [09:38:00] eh, yeah [09:38:14] I was about to ask that :) [09:38:23] maybe we can work on it next week? [09:38:34] like a quick draft and then commit to do the work next quarter [09:38:40] should be relatively easy no? [09:39:07] (brb) [09:39:12] (03PS1) 10Elukey: role::aqs: set cassandra version to 2.2.6-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/440308 (https://phabricator.wikimedia.org/T197062) [09:52:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440310 [09:53:33] (03PS2) 10Giuseppe Lavagetto: Fix remaining rubocop violations [puppet] - 10https://gerrit.wikimedia.org/r/440296 [09:56:20] (03PS4) 10Ema: cache: allow installing separate VCL files [puppet] - 10https://gerrit.wikimedia.org/r/440123 (https://phabricator.wikimedia.org/T164609) [09:57:04] (03CR) 10Ema: [C: 032] cache: allow installing separate VCL files [puppet] - 10https://gerrit.wikimedia.org/r/440123 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [09:57:11] RECOVERY - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.072 second response time [09:57:55] !log for i in {1..1000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki // T197222 [09:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:00] T197222: Lemma value and language removed from Lexeme page header - https://phabricator.wikimedia.org/T197222 [09:58:58] (03PS1) 10Volans: Host detail: fix package sorting order [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440312 (https://phabricator.wikimedia.org/T167504) [09:59:12] !log @terbium: for i in {1001..2000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki // T197222 [09:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:12] (03PS3) 10Giuseppe Lavagetto: Fix remaining rubocop violations [puppet] - 10https://gerrit.wikimedia.org/r/440296 [10:00:26] (03CR) 10jerkins-bot: [V: 04-1] Host detail: fix package sorting order [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440312 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:04:00] !log @terbium: for i in {2001..3000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki // T197222 [10:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:04] T197222: Lemma value and language removed from Lexeme page header - https://phabricator.wikimedia.org/T197222 [10:04:31] PROBLEM - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.086 second response time [10:08:08] <_joe_> addshore: I think you should start using mwmaint1001, just FYI [10:08:16] oh yes! [10:08:24] <_joe_> :) [10:08:27] _joe_: very true, I did read that mail :D [10:08:32] * addshore goes to update his bash alias [10:09:03] <_joe_> addshore: mostly as in a few hours (I guess?) you won't have access to terbium anymore [10:09:25] yup, just updated my bash alias, I'll be using mwmaint from now on :) [10:09:57] <_joe_> what a nice name schema we picked, heh? [10:09:58] <_joe_> :P [10:10:34] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix remaining rubocop violations [puppet] - 10https://gerrit.wikimedia.org/r/440296 (owner: 10Giuseppe Lavagetto) [10:14:41] (03CR) 10Muehlenhoff: [C: 031] Host detail: fix package sorting order [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440312 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:15:08] Amir1: i am outside WMDE office, let me in :) [10:15:35] mutante: oooh, have you seen the 3rd floor yet? [10:15:50] no, never been inside this building [10:15:54] oooooh [10:16:05] Tempelhofer Ufer 23 [10:16:12] you should be able to buzz the wikimedia button downstairs and someone will let you in [10:16:26] ok) [10:16:28] then you can do to the wmde door [10:16:46] 1st floor there is a reception and you can talk to them, then they will take you to the 3rd floor i guess [10:20:19] (03PS3) 10Giuseppe Lavagetto: Use the newer version of the wmf_styleguide check [puppet] - 10https://gerrit.wikimedia.org/r/440294 [10:20:22] addshore: i made it to 3rd floor but there is no door sign? [10:20:36] Its fine, its defintly right [10:20:50] is Amir1 deifntly in the office? :P if you ring the bell someone should come [10:21:14] somebody opened but doesnt know WIkimedia [10:21:22] no door bell :) [10:21:36] which 3rd floor are you on? :P [10:21:45] (03PS2) 10Urbanecm: Set a few of namespace aliases on ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438283 (https://phabricator.wikimedia.org/T196719) [10:22:06] (03CR) 10Giuseppe Lavagetto: [C: 032] Use the newer version of the wmf_styleguide check [puppet] - 10https://gerrit.wikimedia.org/r/440294 (owner: 10Giuseppe Lavagetto) [10:22:35] is there a physical button that says "Wikimedia" on it? [10:22:44] legoktm: not on the third floor, but there is a bell :P [10:23:08] the physical button is at the entrance and 1st floor only :) [10:24:00] (03PS2) 10Giuseppe Lavagetto: role::deployment_server: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436760 [10:24:14] (03Abandoned) 10Urbanecm: Update static logo resources for bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438088 (https://phabricator.wikimedia.org/T196599) (owner: 10Urbanecm) [10:24:16] (03Abandoned) 10Urbanecm: Use HD logos in bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438089 (https://phabricator.wikimedia.org/T196599) (owner: 10Urbanecm) [10:24:37] (03CR) 10jerkins-bot: [V: 04-1] role::deployment_server: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436760 (owner: 10Giuseppe Lavagetto) [10:26:45] (03PS2) 10Urbanecm: Allow bcts on private&fishbowl wikis advanced privilege manipulation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) [10:27:02] (03PS2) 10Urbanecm: Clean legacy AddGroups/RemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440001 (https://phabricator.wikimedia.org/T197024) [10:27:11] (03PS2) 10Urbanecm: Some wikis bureacurats are able to grant non-grantable groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026) [10:27:31] (03PS2) 10Urbanecm: Clear duplicate right specification [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) [10:30:50] mutante: I assume you are in? :) [10:37:06] 10Operations, 10Cassandra, 10Maps-Sprint, 10Release-Engineering-Team, 10Scap: cassandra/metrics-collector does not deploy with scap on a new install - https://phabricator.wikimedia.org/T197159#4280346 (10Joe) Yes, giving a simple `scap deploy --init` in the directory on deployment1001 was enough to fix t... [10:37:21] 10Operations, 10Cassandra, 10Maps-Sprint, 10Release-Engineering-Team, 10Scap: cassandra/metrics-collector does not deploy with scap on a new install - https://phabricator.wikimedia.org/T197159#4282505 (10Joe) 05Open>03Resolved a:03Joe [10:45:46] addshore: yes, i made it into 3rd floor [10:45:50] woo! [10:45:52] and on the guest wifi right now [10:46:03] waiting for wikidata team to end their meeting [10:46:11] hehe, im in that meeting right now ;) [10:47:27] addshore: quick, end the meeting [10:50:56] hahaa [10:56:08] Meeting? :O [10:58:20] Bsadowski1: Yep, talking about the future of the Wikidata frontend :P [10:59:53] adds to the agenda 'how to lower query dispatcher lag' [11:00:05] * hoo has some thoughts on that [11:00:08] :) [11:00:11] mutante: is mwmaint1001 newer hardware than terbium? :P [11:00:15] addshore: yes [11:00:18] * addshore also has some thoughts on that [11:00:29] mutante: will be interesting to see if it increases the performance of dispatching at all :P [11:00:43] i just talked to daniel about it [11:00:50] It will… turning the hhvm JIT on did wonders [11:00:51] and he said we can run it on both servers at the same time [11:00:58] mutante: Yeah [11:00:59] but, at the end of the day, the whole thing needs to be rewritten anyway [11:00:59] since it uses redis for locking and not lock files [11:01:04] mutante: indeed [11:01:10] so we can ..just activate it on mwmaint1001 and not touch terbium [11:01:18] then how to confirm it works from mwmaint1001 [11:01:23] mutante: i would be slightly worried about runnign 2x the number of dispatchers [11:02:44] well, i mean, they will just end up hitting more locks / lockouts, but meh, probably doesnt actually matter [11:02:49] what would be the worst case that you worry about [11:03:36] well, i guess it will just be increased db queries, not necessarily for any more actual dispatching [11:04:12] i see, *nod* [11:05:24] RECOVERY - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.079 second response time [11:07:57] (03Abandoned) 10Dzahn: Revert "deployment::server: add rsync for home dirs" [puppet] - 10https://gerrit.wikimedia.org/r/436995 (owner: 10Dzahn) [11:12:33] PROBLEM - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.081 second response time [11:13:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440310 (owner: 10Marostegui) [11:14:47] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440310 (owner: 10Marostegui) [11:15:19] (03PS1) 10Volans: Frontend: specify items shown per page [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440326 (https://phabricator.wikimedia.org/T167504) [11:16:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1114 after alter table (duration: 01m 01s) [11:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:43] !log upgrade cassandra on aqs* to 2.2.6-wmf5 [11:16:43] (03CR) 10jerkins-bot: [V: 04-1] Frontend: specify items shown per page [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440326 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:51] (03PS2) 10Elukey: role::aqs: set cassandra version to 2.2.6-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/440308 (https://phabricator.wikimedia.org/T197062) [11:16:58] (03PS2) 10Volans: Frontend: specify items shown per page [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440326 (https://phabricator.wikimedia.org/T167504) [11:17:42] (03CR) 10Elukey: [C: 032] role::aqs: set cassandra version to 2.2.6-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/440308 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [11:18:10] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440310 (owner: 10Marostegui) [11:18:15] (03CR) 10jerkins-bot: [V: 04-1] Frontend: specify items shown per page [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440326 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:19:36] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440327 (https://phabricator.wikimedia.org/T191316) [11:20:41] (03PS1) 10Dzahn: mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [11:21:36] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:22:17] (03PS2) 10Dzahn: mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [11:22:23] PROBLEM - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.126 and port 9042: Connection refused [11:22:25] (03PS3) 10Dzahn: mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [11:22:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440327 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [11:23:20] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:23:24] RECOVERY - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.126 port 9042 [11:23:34] (03CR) 10Dzahn: [C: 04-1] mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:23:39] aqs noise is me doing restarts [11:23:41] (03CR) 10Muehlenhoff: [C: 031] "Thanks! We can fine-tune the other tables as the real world usage goes." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440326 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:24:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440327 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [11:25:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1067 for alter table (duration: 00m 57s) [11:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:42] (03PS4) 10Dzahn: mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [11:27:39] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:28:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440327 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [11:33:03] (03PS5) 10Dzahn: mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [11:34:01] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:38:21] !log Deploy schema change on db1067 T191316 T192926 T89737 T195193 [11:38:27] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:32] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [11:38:32] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [11:38:32] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [11:38:33] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [11:38:44] (03PS6) 10Dzahn: mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [11:38:46] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:39:35] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:39:39] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:39:46] sigh :) [11:40:24] 10Operations, 10ops-esams: Move cp3030+ from OE14 to OE13 in racktables - https://phabricator.wikimedia.org/T136403#4282599 (10mark) p:05Lowest>03High [11:41:34] paladox: do you see what i'm doing wrong? [11:42:06] NoMethodError: undefined method `positive?' for 0:Fixnum [11:43:22] That sounds like a bug [11:43:27] As rake aborted [11:43:34] (03PS7) 10Dzahn: mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [11:43:47] 10Operations, 10ops-esams: Move cp3030+ from OE14 to OE13 in racktables - https://phabricator.wikimedia.org/T136403#4282614 (10mark) a:03mark It looks like cp3030-cp3039 are in OE13 (despite what Racktables says), and cp3040+ are in OE40. So this is swapped from reality in Racktables. This is confirmed with... [11:44:25] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [11:45:06] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/439554 (owner: 10Dzahn) [11:45:19] paladox: running recheck on some random other change to confirm [11:45:37] (03CR) 10jerkins-bot: [V: 04-1] network::monitor: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/439554 (owner: 10Dzahn) [11:45:57] yea, CI is broken [11:46:01] that same thing was +2 before [11:46:23] Ok [11:46:34] reporting in -releng [11:46:40] Ok [11:51:49] <_joe_> maybe it's my fault mutante [11:52:11] _joe_: NoMethodError: undefined method `positive?' for 0:Fixnum [11:52:14] <_joe_> 11:45:35 NoMethodError: undefined method `positive?' for 0:Fixnum [11:52:18] <_joe_> yeah, my fault [11:52:25] <_joe_> it worked in rbenv though [11:52:32] <_joe_> mutante: I'll fix it [11:52:58] <_joe_> sorry [11:55:48] _joe_: ah:) thanks! [11:57:59] RECOVERY - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.076 second response time [11:58:00] (03PS1) 10Giuseppe Lavagetto: rake: fix error in uncommitted_changes [puppet] - 10https://gerrit.wikimedia.org/r/440335 [11:58:27] (03CR) 10Giuseppe Lavagetto: [C: 032] rake: fix error in uncommitted_changes [puppet] - 10https://gerrit.wikimedia.org/r/440335 (owner: 10Giuseppe Lavagetto) [12:00:40] <_joe_> uhm [12:00:44] (03PS2) 10Giuseppe Lavagetto: network::monitor: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/439554 (owner: 10Dzahn) [12:01:27] <_joe_> ok fixed mutante [12:10:29] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 503 (expecting: 200) [12:11:19] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 360.20 seconds [12:11:48] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [12:12:49] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 376.44 seconds [12:16:58] (03PS6) 10ArielGlenn: allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) [12:16:58] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 82 ESP OK [12:16:59] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 68 ESP OK [12:16:59] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 68 ESP OK [12:17:08] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [12:17:08] RECOVERY - Host cp3037 is UP: PING OK - Packet loss = 0%, RTA = 83.65 ms [12:17:08] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 68 ESP OK [12:17:09] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 82 ESP OK [12:17:18] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 82 ESP OK [12:17:19] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 68 ESP OK [12:17:19] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 82 ESP OK [12:17:28] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 82 ESP OK [12:17:28] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 82 ESP OK [12:17:28] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 82 ESP OK [12:17:28] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [12:17:29] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [12:17:29] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 68 ESP OK [12:17:29] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [12:17:29] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [12:17:30] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 82 ESP OK [12:17:38] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 68 ESP OK [12:17:38] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [12:17:39] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 68 ESP OK [12:17:48] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 68 ESP OK [12:17:49] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 68 ESP OK [12:17:58] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 82 ESP OK [12:17:58] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK [12:17:59] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 82 ESP OK [12:17:59] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 68 ESP OK [12:19:08] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 47.91 seconds [12:19:28] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:19:49] PROBLEM - Freshness of zerofetch successful run file on cp3037 is CRITICAL: CRITICAL: File /var/netmapper/.update-success is more than 86400 secs old! [12:20:08] (03PS1) 10Elukey: role::aqs: deploy different cassandra versions [puppet] - 10https://gerrit.wikimedia.org/r/440337 (https://phabricator.wikimedia.org/T197062) [12:20:30] those alerts there are due to cp3037 being manually rebooted by remote hands ^ [12:22:08] mutante: also, before you kill the cron, I guess you could test it on mwmaint1001 by running it by hand with a short max time to make sure it dispatches correctly [12:22:14] if you wanted to :) [12:22:55] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11515/" [puppet] - 10https://gerrit.wikimedia.org/r/440337 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [12:23:08] !log rolling restart of elasticsearch codfw for plugin upgrade - T194245 [12:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:14] T194245: Implement searching of 'depicts' on commons with the 'quantity' qualifier - https://phabricator.wikimedia.org/T194245 [12:24:09] RECOVERY - Freshness of zerofetch successful run file on cp3037 is OK: OK [12:27:23] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/440337 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [12:29:52] !log cp3037: run update-ocsp-all [12:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:42] addshore: yes, let's do that :) i will upload a patch to let us switch just wikidata cron without touching other crons [12:33:35] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/440293 (owner: 10Giuseppe Lavagetto) [12:33:50] (03CR) 10jerkins-bot: [V: 04-1] Fix gemspec warnings [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/440293 (owner: 10Giuseppe Lavagetto) [12:33:59] mutante: is mwmaint1001 already there with code on etc? [12:35:02] RECOVERY - DPKG on multatuli is OK: All packages OK [12:35:49] mutante: I just ran the dispatch script there for 10 seconds and everything ran just fine :) [12:44:12] (03CR) 10Volans: [V: 032 C: 032] "Tested live, all good." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440127 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:44:55] (03PS1) 10Muehlenhoff: Fix binary name in apt hook [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440339 [12:45:43] PROBLEM - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.095 second response time [12:45:59] (03CR) 10Volans: [C: 031] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440339 (owner: 10Muehlenhoff) [12:46:11] (03CR) 10jerkins-bot: [V: 04-1] Fix binary name in apt hook [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440339 (owner: 10Muehlenhoff) [12:46:44] (03CR) 10BBlack: [C: 04-1] vcl: use synthetic warning for 1% of AES128-SHA pageviews (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [12:47:09] (03CR) 10Muehlenhoff: [V: 032 C: 032] Fix binary name in apt hook [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/440339 (owner: 10Muehlenhoff) [12:49:12] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [12:51:37] (03CR) 10Volans: [V: 032 C: 032] "@akosiaris, @vgutierrez feel free to comment post-merge, we're merging to get unblocked and proceed with the deployment, so that we catch " [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440155 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:52:05] (03CR) 10Volans: [V: 032 C: 032] Tests: remove spurious lines [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440167 (owner: 10Volans) [12:52:22] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:52:58] (03CR) 10Volans: [V: 032 C: 032] "@akosiaris, @vgutierrez feel free to comment post-merge, we're merging to get unblocked and proceed with the deployment, so that we catch " [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440168 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:53:00] (03PS4) 10Vgutierrez: vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) [12:53:21] (03CR) 10Vgutierrez: vcl: use synthetic warning for 1% of AES128-SHA pageviews (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [12:53:31] (03CR) 10Volans: [V: 032 C: 032] Use sha256 as checksum for the client self-update [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440275 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [12:54:08] (03CR) 10Volans: [V: 032 C: 032] "@akosiaris feel free to comment post-merge, we're merging to get unblocked and proceed with the deployment, so that we catch more issues." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440312 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [12:54:27] volans: ok, I 'll have a look and revert later on :P [12:54:43] akosiaris: welcome back online :) [12:54:53] !log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds [12:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:57] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [12:55:14] (03CR) 10Volans: [V: 032 C: 032] Frontend: specify items shown per page [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440326 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [12:55:53] RECOVERY - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1957 bytes in 0.077 second response time [12:57:37] (03CR) 10BBlack: [C: 031] vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [12:58:58] (03Abandoned) 10Volans: Add initial Debianisation of debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [12:59:15] jouncebot, next [12:59:15] In 0 hour(s) and 0 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180614T1300) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180614T1300). [13:00:04] leszek_wmde and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] Here! [13:00:44] ditto [13:00:51] it's that time of day, again ;) [13:00:55] I can SWAT today [13:01:10] leszek_wmde, you are not a deployer, right? [13:01:16] zeljkof: no [13:01:31] o/ [13:01:36] leszek_wmde: ok, I'll let you know when the patch is at mwdebug [13:01:45] Urbanecm: please stand by, you are second [13:01:50] zeljkof: ok [13:01:50] Sure [13:04:09] leszek_wmde: um, your patch has -1 from jenkins-bot [13:04:16] ooooh [13:04:17] but I do not see any failed jobs :/ [13:04:27] ah, there is one [13:04:29] zeljkof: that's phan job failing on non-master patches [13:04:41] * addshore can vouch for the fact phan always fails on wikibase on the branch :) [13:04:52] zeljkof: known issue with wikibase (as in: i saw it happen bfore) [13:04:53] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4253860 (10Verdy_p) Isn"t there a way for the wiki server to autodetect those browsers that are still using the legacy TLS implementation and add som... [13:05:09] leszek_wmde, addshore: could you please leave a comment at the patch that it's a known problem? [13:05:36] addshore: did we file a ticket for this alreayd? [13:05:40] zeljkof: can do [13:05:49] leszek_wmde: I don't rememmber D: [13:06:18] zeljkof: I'm happy to deploy this one if you would like! [13:07:20] addshore: I'm fine with deploying it, but I really don't like failing jobs, so just a bit of security (comment, known bug...) would help :) [13:07:28] yupp [13:07:37] youll have to +2V it and force merge it on gerrit [13:07:52] (03PS1) 10Volans: Updated src to v0.1.4 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/440340 (https://phabricator.wikimedia.org/T191299) [13:07:54] (03PS1) 10Volans: Built wheels for v0.1.4 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/440341 (https://phabricator.wikimedia.org/T191299) [13:08:59] (03CR) 10Volans: [V: 032 C: 032] Updated src to v0.1.4 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/440340 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:09:14] (03CR) 10Volans: [V: 032 C: 032] Built wheels for v0.1.4 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/440341 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:10:04] What's SWAT status? I'm seeing a little jenkins problem [13:10:32] Urbanecm: swat is in progress, what's the problem? [13:10:53] !log volans@deploy1001 Started deploy [debmonitor/deploy@476fd8b]: Release v0.1.4 [13:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:07] zeljkof, 440295 had -1 from jenkins, that's the problem I was seeing [13:11:13] !log volans@deploy1001 Finished deploy [debmonitor/deploy@476fd8b]: Release v0.1.4 (duration: 00m 19s) [13:11:16] I meant, what's happening with the change, as I didn't get it [13:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:12] Urbanecm: see our discussion, or refresh the patch in gerrit, looks like it's a known bug [13:12:51] zeljkof, right now, I see your CR+2 in gerrit and nothing on zuul [13:13:13] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282738 (10Verdy_p) Note that because of ULS, using "Wikipedia" instead of "Wikimedia" is still accurate: the secure logon will be made on other wiki... [13:13:15] Urbanecm: I see 440295 in gate-and-submit-swat in zuul [13:13:40] Oh, it's in gate-and-submit-swat, not in gate-and-submit. My mistake, didn't know about the other queue [13:13:53] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T197237#4282750 (10MSantos) [13:14:19] Urbanecm: namespaceDupes needs to run for T197033? [13:14:19] T197033: ProofreadPage namespaces are wrong on pms.source - https://phabricator.wikimedia.org/T197033 [13:14:24] Urbanecm: nothing else? [13:14:32] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 354.02 seconds [13:14:34] Patch is merged already, so only the script [13:14:43] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.75 seconds [13:14:44] Urbanecm: ok, running the script [13:14:46] thx [13:15:26] 10Operations, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4282762 (10Urbanecm) [13:15:33] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.22 seconds [13:15:41] 10Operations, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4282750 (10Urbanecm) I've renamed the task to better express its sense. Feel free to edit @MSantos anyway. [13:15:52] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:16:08] zeljkof, can you then post the output to T197033 please? [13:16:20] Urbanecm: done https://phabricator.wikimedia.org/T197033#4282765 [13:16:22] thx [13:16:29] but looks like there is nothing to fix :/ [13:16:48] Then there's other issue I'm not seeing. Will investigate after swat [13:16:49] But thank you [13:17:18] 10Operations, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4282767 (10MSantos) Thanks @Urbanecm [13:18:06] (03PS1) 10Ema: reload-vcl: add --separate-vcls [puppet] - 10https://gerrit.wikimedia.org/r/440342 (https://phabricator.wikimedia.org/T164609) [13:18:32] PROBLEM - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1962 bytes in 0.075 second response time [13:20:12] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438283 (https://phabricator.wikimedia.org/T196719) (owner: 10Urbanecm) [13:20:36] (03PS2) 10Ema: reload-vcl: add --separate-vcls [puppet] - 10https://gerrit.wikimedia.org/r/440342 (https://phabricator.wikimedia.org/T164609) [13:22:16] (03Merged) 10jenkins-bot: Set a few of namespace aliases on ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438283 (https://phabricator.wikimedia.org/T196719) (owner: 10Urbanecm) [13:22:37] 10Operations, 10Patch-For-Review: operations/software repo: flake8 check - https://phabricator.wikimedia.org/T178877#4282773 (10Marostegui) [13:22:45] (03CR) 10jenkins-bot: Set a few of namespace aliases on ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438283 (https://phabricator.wikimedia.org/T196719) (owner: 10Urbanecm) [13:25:47] leszek_wmde, addshore: 440295 is at mwdebug1002 for testing, let me know if I can deploy it [13:25:57] zeljkof: thanks! [13:26:30] addshore: Im editing Q33 on test, or are you? [13:26:37] ill let you :) [13:26:39] addshore: awesome! sorry for the delay, we were out for food [13:26:53] was waiting until after the deploy window [13:27:05] Urbanecm: 438283 is at mwdebug [13:27:15] ack [13:27:28] (03PS3) 10Zfilipin: Allow bcts on private&fishbowl wikis advanced privilege manipulation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [13:27:35] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#4282783 (10Cmjohnson) [13:28:00] working, please deploy [13:29:07] (03PS8) 10Dzahn: mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [13:29:58] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: run wikidata maint jobs on old and new server [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [13:31:23] (03PS9) 10Dzahn: mw-maintenance: switch only wikidata maint jobs to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [13:32:07] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: switch only wikidata maint jobs to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [13:33:49] addshore: so the sitelink part seems to work, but I am seeing weird behaviour when using #statemetns parser function on test.wikipedia [13:33:56] addshore: I assume it is no good [13:33:58] ? [13:34:19] Urbanecm: deploying [13:34:23] ack [13:34:56] leszek_wmde: hmm [13:34:58] not sure :/ [13:35:09] leszek_wmde: weird behaviour ? [13:35:24] internal server error when lexeme property used [13:35:26] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:438283|Set a few of namespace aliases on ruwikisource (T196719)]] (duration: 00m 59s) [13:35:28] Urbanecm: 438283 deployed [13:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:31] T196719: Adding aliases in ru.wikisource - https://phabricator.wikimedia.org/T196719 [13:36:43] ac [13:36:46] k [13:36:57] addshore: so when I do {{#statements:P74993|from=Q33}} on any test.wikipedia page it does not seem to work. I.e. I guess rolling back? [13:37:10] leszek_wmde: yeh i guess :/ [13:37:18] addshore: interestingly the original bug the ticket was mentioning is fixed :) [13:37:20] mmmmmff [13:37:38] zeljkof: we're broken. could you please roll back? [13:37:45] yay reverts [13:37:49] Urbanecm: for 440000, looks like MarcoAurelio would like more discussion before it is deployed? [13:38:07] leszek_wmde, addshore: ok, reverting [13:38:26] (03PS10) 10Dzahn: mw-maintenance: switch only wikidata maint jobs to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [13:38:27] I don't think futher discussions are necessary, this is already added to new private/fishbowl wikis _by default_ [13:38:32] (with no explicit request) [13:40:45] Urbanecm: I would still like a +1 from anybody before deploying it, I don't feel comfortable, please understand that I don't really understand the details :) [13:41:25] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282813 (10Aklapper) @Verdy_p: IMO your question is already answered in the task description. Apart from that it seems unclear which actual problems... [13:41:28] (03PS11) 10Dzahn: mw-maintenance: switch only wikidata maint jobs to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [13:43:44] Ok then zeljkof [13:43:46] But other patches are depending on this one [13:45:54] Urbanecm: I would really really like some discussion about it either in gerrit or phab, before I deploy it [13:46:05] ok [13:46:09] zeljkof: thank you for your assistance. please tell me when I could be dismissed [13:46:16] I'm not familiar with the feature, so I hesitate to break stuff :/ [13:46:18] !log zfilipin@deploy1001 Synchronized php-1.32.0-wmf.8/extensions/Wikibase: SWAT: [[gerrit:440343| Revert "Statement transclusion: when entity of unknown type in statement, display ID as string" (T195615)]] (duration: 01m 21s) [13:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:22] T195615: handle use of statements linking to Lexemes (and Forms?) more gracefully on client - https://phabricator.wikimedia.org/T195615 [13:46:22] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler03/11517/" [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [13:46:37] leszek_wmde: I have just deployed the revert, you are free like a bird! :D [13:47:22] zeljkof: thank you [13:47:32] Urbanecm: thanks for deploying with #releng then! (I assume none of the remaining patches could be deployed, if they depend on 440000) [13:47:46] !log EU SWAT finished [13:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:54] You're right [13:47:59] Thank you for your help [13:49:14] (03PS2) 10Dzahn: Make Phabricator footer links use Special:MyLanguage [puppet] - 10https://gerrit.wikimedia.org/r/439482 (https://phabricator.wikimedia.org/T196836) (owner: 10Aklapper) [13:49:36] (03CR) 10Dzahn: [C: 032] Make Phabricator footer links use Special:MyLanguage [puppet] - 10https://gerrit.wikimedia.org/r/439482 (https://phabricator.wikimedia.org/T196836) (owner: 10Aklapper) [13:51:10] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282826 (10BBlack) >>! In T196371#4282728, @Verdy_p wrote: > Isn"t there a way for the wiki server to autodetect those browsers that are still using... [13:51:23] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282827 (10Verdy_p) Not able to even read the wiki in an enforced incognito mode (removing all private session keys, disabling some scripts, just ren... [13:53:22] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 59834 MB (12% inode=99%) [13:53:53] RECOVERY - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1977 bytes in 0.078 second response time [14:04:06] hello [14:04:27] bad kids [14:09:32] RECOVERY - Disk space on elastic1018 is OK: DISK OK [14:10:01] Amir1: this is the check URL https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&format=json&siprop=statistics [14:10:53] Amir1: --ereg '"median":[^}]*"lag":([1-2]?[0-9]?[0-9]|600),' [14:10:59] yes, it is median [14:15:12] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4282898 (10Cmjohnson) @faidon, I had previously tried to connect to 10G and did not have any luck so I ended up connecting the ethernet p... [14:15:35] Krinkle: I have now deleted all autopatrol logs everywhere now (including 2008-2011 logs) [14:17:57] (03Abandoned) 10Ladsgroup: mediawiki: Stop Wikidata dispatching [puppet] - 10https://gerrit.wikimedia.org/r/440142 (https://phabricator.wikimedia.org/T192092) (owner: 10Ladsgroup) [14:18:11] (03CR) 10Hashar: [C: 031] "Looks good to me for the technical side." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [14:19:21] (03CR) 10Hashar: [C: 031] "Oh and Urbanecm, I am fine deployed this change without you being around. I should be able to verify the user rights are sane on private/f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [14:23:01] (03CR) 10Ladsgroup: [C: 031] mw-maintenance: switch only wikidata maint jobs to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [14:23:50] (03CR) 10Hoo man: [C: 031] mw-maintenance: switch only wikidata maint jobs to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [14:28:56] (03PS1) 10Dzahn: wikidata/icinga: turn wikidata dispatch check into a WARN [puppet] - 10https://gerrit.wikimedia.org/r/440347 [14:29:40] (03CR) 10Dzahn: [C: 032] "also: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440347/" [puppet] - 10https://gerrit.wikimedia.org/r/439528 (https://phabricator.wikimedia.org/T194602) (owner: 10Addshore) [14:29:52] PROBLEM - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1970 bytes in 0.094 second response time [14:30:50] (03CR) 10Dzahn: [C: 032] wikidata/icinga: turn wikidata dispatch check into a WARN [puppet] - 10https://gerrit.wikimedia.org/r/440347 (owner: 10Dzahn) [14:31:16] (03PS2) 10Dzahn: wikidata/icinga: turn wikidata dispatch check into a WARN [puppet] - 10https://gerrit.wikimedia.org/r/440347 [14:32:23] (03CR) 10Dzahn: [V: 032 C: 032] wikidata/icinga: turn wikidata dispatch check into a WARN [puppet] - 10https://gerrit.wikimedia.org/r/440347 (owner: 10Dzahn) [14:32:48] !log (slow) initial rollout of debmonitor-client [14:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:42] (03PS12) 10Dzahn: mw-maintenance: switch only wikidata maint jobs to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) [14:35:37] ACKNOWLEDGEMENT - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1942 bytes in 0.071 second response time daniel_zahn . [14:36:52] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 355.89 seconds [14:37:23] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 372.75 seconds [14:37:47] anomie ^ how is that script going? [14:38:06] do you have any ETA / progress you can share? [14:38:40] (03CR) 10Dzahn: [C: 032] mw-maintenance: switch only wikidata maint jobs to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/440328 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [14:40:02] RECOVERY - wikidata.org dispatch lag is higher than 600s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.071 second response time [14:40:09] !log moving wikidata query dispatcher from terbium to mwmaint1001 - scheduled downtime - check turned into a WARN - disabling puppet on mwmaint1001, removing crons on terbium, waiting a couple minutes for them to finish, re-enabling puppet on mwmaint1001 (T192092) [14:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:14] T192092: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092 [14:43:09] mutante: woo! [14:43:26] switching the wikidata dispatcher jobs now to new host. also the check shouldn't alert anymore and is in scheduled downtime. it will take 12 minutes for existing jobs to finish on terbium. then mwmaint1001 will be enabled [14:44:17] * addshore would be find with the new cron being enabled already ;) [14:44:17] It should be possible to just the crons on the new host straight ahead [14:44:20] *fine [14:44:50] this is known to be fine [14:45:57] paranoia that it could overload db? [14:46:06] is what has been mentioned earlier [14:47:18] Shouldn't [14:47:23] new isntances only start every 3m [14:47:38] so we will not exceed the number of instances that would run on terbium anyway [14:48:00] This would have only been a problem if both servers keep spawning new instances ever 3m [14:50:00] mutante: i was against having the 2 crons enabled at the same time [14:50:11] but turning one off and the other on at the same time i see no problem with :) [14:51:29] ok and it's down to 2 dispathers. enabling puppet on mwmaint1001 [14:51:39] sonuds good! [14:52:20] * addshore switched the load graph on https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch to point at mwmaint1001 [14:52:56] crons have been added on new server [14:53:00] wheeee [14:53:24] we should see "script starts" in the top left start firing again soon https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch-script [14:53:33] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4282963 (10chasemp) We met today to sync up on moving the remaining lab* servers. Hopefully these days/times al work for @cmjohnson (I added him to the calend... [14:54:27] mutante: looks like the first cron just fired and is dispatching :) [14:54:32] * addshore would consider this a success [14:54:38] :) very nice [14:54:45] we did _not_ copy that one logfile yet [14:54:59] but looks like we didnt have to [14:55:06] copy the old log file to the new server? mhmm, porbbaly no point [14:55:39] No point IMO [14:55:53] We only look into this every once in a while [14:55:55] addshore: It's needed for other cronjob that looks into it [14:55:58] probably just me, acutally [14:56:06] other cronjob? [14:56:08] Amir1: Not for dispatching [14:56:26] for wikidata-clearTermSqlIndexSearchFields [14:56:32] pruneChanges.php is also switched [14:56:33] Yes [14:56:45] but dispatchChanges/ pruneChanges don't need this [14:57:03] yes, the clear term sql index needs it [14:57:34] 631K May 25 05:28 clearTermSqlIndexSearchFields.log-20180525 [14:57:50] nothing is newer than May 25 [14:58:12] i will still copy that one now ^ [14:58:28] "needed in the future" [14:59:40] tail -f dispatchChanges-wikidatawiki.log [14:59:46] sees it working happily :) [14:59:48] dipsatch lag is improving again [14:59:57] * hoo goes for food for a bit [15:00:07] 10Operations, 10fundraising-tech-ops, 10netops: adjust NAT mapping for frdata.wikimedia.org - https://phabricator.wikimedia.org/T196656#4282970 (10ayounsi) NAT change pushed. [15:02:34] it has some "failed to grab dispatch lock" for some.. but not all [15:03:28] 10Operations, 10Analytics, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Clean up cpjobqueue metrics - https://phabricator.wikimedia.org/T196067#4282976 (10fgiunchedi) [15:03:32] RECOVERY - Host labvirt1019 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [15:03:44] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4282978 (10BBlack) We've got some overlapping timelines on these long-form posts :) I assume most of the most-recent one above is in the context of... [15:04:18] normal behaviour also on terbium. ack [15:04:24] PROBLEM - nova-compute proc minimum on labvirt1019 is CRITICAL: Return code of 255 is out of bounds [15:04:37] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4282979 (10jcrespo) To clarify- databases that are depooled do not need to stop replication- if replication goes down it tries to infinitely retry connecting w... [15:04:54] PROBLEM - ensure kvm processes are running on labvirt1019 is CRITICAL: Return code of 255 is out of bounds [15:05:08] (03CR) 10Ema: "Let's add a test case too: https://phabricator.wikimedia.org/P7258" [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [15:05:24] jynus: It has 26 wikis to go, including the one it's on now. Most are small, a few like ukwiki and zhwiki are relatively large. Rough estimate is around 16 hours until it's done. [15:05:53] PROBLEM - DPKG on labvirt1019 is CRITICAL: Return code of 255 is out of bounds [15:06:03] PROBLEM - kvm ssl cert on labvirt1019 is CRITICAL: Return code of 255 is out of bounds [15:06:04] PROBLEM - dhclient process on labvirt1019 is CRITICAL: Return code of 255 is out of bounds [15:06:25] anomie: that sounds to me like 90% of the work done? [15:10:10] mutante: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1&from=now-1h&to=now [15:12:15] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4282998 (10pmiazga) @akosiaris - I did couple checks, looks like there are couple errors to investigate and fix (example: `trying to 'length'... [15:13:21] jynus: Looks like a bit over 96.6% done (by the same numbers used for the 16-hour estimate) with the whole run over all three groups that started May 23. 94% done with the current run over group2 wikis that started last Monday. The estimate is rough because it's just looking at the max el_id on each wiki, rather than how many rows actually need backfilling. [15:14:41] 90 works too, I didn't mean to make you work :-) [15:15:41] * anomie got nerd sniped ;) [15:18:25] (03CR) 10Volans: "Two nits inline (I'll leave the VCL part to the traffic team)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440342 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [15:23:32] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4283034 (10faidon) OK, I managed to get this server to boot from its 10G interfaces. The issue was fairly straightforward to resolve ("net... [15:23:55] 10Operations, 10Analytics, 10Jupyter-Hub, 10SRE-Access-Requests: JupyterHub access for meps not working (was: Requesting access to analytics servers for mepps) - https://phabricator.wikimedia.org/T192472#4283037 (10mepps) That was it! I'm in. Thank you @Dzahn and @Ottomata!! [15:24:33] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching): Transition citoid to using Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242#4283038 (10Mvolz) [15:24:48] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching): Transition citoid to using Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242#4283049 (10Mvolz) 05Open>03stalled p:05Triage>03High [15:25:01] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242#4283038 (10Mvolz) [15:27:02] 10Operations, 10Cloud-Services: 10G ports seem not to work on new HP hardware - https://phabricator.wikimedia.org/T197169#4283053 (10faidon) So for at least labvirt1019 it was indeed about PXE not working (the card worked under Linux) and that was due to a BIOS misconfiguration (the "network boot" option for t... [15:37:26] (03PS2) 10Ottomata: Use Kafka main-eqiad for EventStreams service [puppet] - 10https://gerrit.wikimedia.org/r/439996 (https://phabricator.wikimedia.org/T185225) [15:38:40] (03CR) 10Ottomata: [C: 032] Use Kafka main-eqiad for EventStreams service [puppet] - 10https://gerrit.wikimedia.org/r/439996 (https://phabricator.wikimedia.org/T185225) (owner: 10Ottomata) [15:39:32] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242#4283084 (10Mvolz) [15:39:39] !log switching EventStreams service to be backed by main-eqiad - T185225 [15:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:44] T185225: Move EventStreams to main Kafka clusters - https://phabricator.wikimedia.org/T185225 [15:40:31] 10Operations, 10Citoid, 10Code-Stewardship-Reviews, 10VisualEditor, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4283092 (10Mvolz) I've created a separate task to track transition from translation-server to translation-server-v2 (a... [15:44:01] (03PS15) 10Ottomata: SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [15:44:03] (03PS1) 10Ottomata: Remove unneeded broker.version.fallback for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/440354 (https://phabricator.wikimedia.org/T185225) [15:44:32] (03Abandoned) 10Ottomata: Remove unneeded broker.version.fallback for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/440354 (https://phabricator.wikimedia.org/T185225) (owner: 10Ottomata) [15:45:05] (03PS1) 10Ottomata: Remove unneeded broker.version.fallback for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/440355 (https://phabricator.wikimedia.org/T185225) [15:45:52] (03CR) 10Ottomata: [V: 032 C: 032] Remove unneeded broker.version.fallback for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/440355 (https://phabricator.wikimedia.org/T185225) (owner: 10Ottomata) [15:46:52] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [15:47:52] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [15:52:33] 10Operations, 10Mail, 10Wikimedia-Logstash: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173#4280956 (10faidon) Our email logs can be pretty sensitive, especially since they include our corporate emails passing through (senders, recipients, timestamps etc.). So, I'm a bit concerned about both... [16:00:04] godog, moritzm, and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180614T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:49] (03PS1) 10Papaul: DHCP: Add MAC address and netboot entries for lvs2009 and lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) [16:03:38] (03PS5) 10Vgutierrez: vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) [16:07:26] 10Operations, 10fundraising-tech-ops, 10netops: adjust NAT mapping for frdata.wikimedia.org - https://phabricator.wikimedia.org/T196656#4283202 (10cwdent) 05Open>03Resolved Thanks @ayounsi Site looks good, and slander is working again after kicking syslog a few places [16:07:29] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#4283204 (10cwdent) [16:11:47] 10Operations, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4282750 (10dr0ptp4kt) Approved. [16:12:11] 10Operations, 10Analytics, 10Jupyter-Hub, 10SRE-Access-Requests: JupyterHub access for meps not working (was: Requesting access to analytics servers for mepps) - https://phabricator.wikimedia.org/T192472#4283214 (10herron) 05Open>03Resolved a:03herron [16:15:08] (03PS1) 10Subramanya Sastry: Enable RemexHtml on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440362 (https://phabricator.wikimedia.org/T195263) [16:18:32] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [16:20:23] 10Operations, 10ops-esams, 10Traffic: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974#4274802 (10ema) The host and its management interface are back online. It seems like we're looking at a thermal issue, here are kernel logs at the time of the crash: ``` Jun 12 06:26:51 cp3037... [16:21:43] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:25:22] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [16:26:22] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [16:30:30] (03PS6) 10Vgutierrez: vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) [16:30:37] (03PS3) 10Ema: reload-vcl: add --separate-vcls [puppet] - 10https://gerrit.wikimedia.org/r/440342 (https://phabricator.wikimedia.org/T164609) [16:31:03] (03PS1) 10Volans: Puppet agent: fix redirect to syslogs [puppet] - 10https://gerrit.wikimedia.org/r/440365 (https://phabricator.wikimedia.org/T191300) [16:31:47] (03CR) 10Ema: reload-vcl: add --separate-vcls (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440342 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [16:36:04] (03PS7) 10Vgutierrez: vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) [16:36:12] (03CR) 10Ema: [C: 031] vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [16:37:33] (03PS1) 10Andrew Bogott: labtestn: use proper labtestn db password from hiera [puppet] - 10https://gerrit.wikimedia.org/r/440366 [16:41:14] !log disable puppet on cache nodes before merging gerrit/440114 - T192555 [16:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:19] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [16:43:19] (03PS8) 10Vgutierrez: vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) [16:43:33] (03CR) 10Vgutierrez: [C: 032] vcl: use synthetic warning for 1% of AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/440114 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [16:44:54] (03PS1) 10Volans: scap: add service name to restart on deploy [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/440368 (https://phabricator.wikimedia.org/T191299) [16:53:56] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4283368 (10Cmjohnson) The dates work for me..I accepted the calendar invites [16:54:55] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4283377 (10bd808) [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180614T1700). [17:03:23] (03CR) 10Jforrester: [C: 031] "Oops, knew we'd forget something…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440283 (owner: 10Legoktm) [17:05:32] (03PS2) 10Reedy: Bump ExtensionDistributor default to REL1_31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440283 (owner: 10Legoktm) [17:05:38] (03CR) 10Reedy: [C: 031] "Stupid portals" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440283 (owner: 10Legoktm) [17:06:45] 10Operations, 10ops-eqiad, 10DC-Ops: Power supply issue on maps1002 - https://phabricator.wikimedia.org/T196897#4283416 (10Cmjohnson) 05Open>03Resolved Replaced the power supply sent the part back via UPS 1Z W09 48Y 90 8422 5343 [17:07:29] (03CR) 10Jforrester: [C: 031] Enable RemexHtml on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440362 (https://phabricator.wikimedia.org/T195263) (owner: 10Subramanya Sastry) [17:08:17] !log applying ACLs to Kafka main-codfw and main-eqiad - T196081 [17:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:21] T196081: Enable TLS and authorization for cross DC MirrorMaker - https://phabricator.wikimedia.org/T196081 [17:10:16] !log bouncing kafka broker on kafka2003 to make sure ACLs are okk [17:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:47] 10Operations, 10Discovery, 10Discovery-Search, 10Wikidata, and 4 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4283427 (10debt) 05Open>03Resolved [17:14:45] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Easy: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#4283447 (10debt) [17:18:32] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [17:21:21] (03PS10) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [17:21:27] (03PS1) 10Vgutierrez: vcl: fix html layout on browsersec [puppet] - 10https://gerrit.wikimedia.org/r/440372 (https://phabricator.wikimedia.org/T192555) [17:21:52] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:22:25] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [17:23:43] (03CR) 10Vgutierrez: [C: 032] vcl: fix html layout on browsersec [puppet] - 10https://gerrit.wikimedia.org/r/440372 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [17:23:50] (03PS2) 10Vgutierrez: vcl: fix html layout on browsersec [puppet] - 10https://gerrit.wikimedia.org/r/440372 (https://phabricator.wikimedia.org/T192555) [17:25:21] (03PS1) 10Ottomata: Properly name dummy kafka_main-eqiad_broker certificate files [labs/private] - 10https://gerrit.wikimedia.org/r/440373 [17:26:10] (03CR) 10Ottomata: [V: 032 C: 032] Properly name dummy kafka_main-eqiad_broker certificate files [labs/private] - 10https://gerrit.wikimedia.org/r/440373 (owner: 10Ottomata) [17:28:16] (03PS11) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [17:28:58] (03PS1) 10Vgutierrez: vcl: disable show_diff for browsersec.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/440375 (https://phabricator.wikimedia.org/T192555) [17:29:20] (03CR) 10jerkins-bot: [V: 04-1] vcl: disable show_diff for browsersec.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/440375 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [17:29:59] (03PS2) 10Vgutierrez: vcl: disable show_diff for browsersec.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/440375 (https://phabricator.wikimedia.org/T192555) [17:30:19] (03CR) 10jerkins-bot: [V: 04-1] vcl: disable show_diff for browsersec.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/440375 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [17:30:52] (03PS3) 10Vgutierrez: vcl: disable show_diff for browsersec.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/440375 (https://phabricator.wikimedia.org/T192555) [17:31:18] (03CR) 10jerkins-bot: [V: 04-1] vcl: disable show_diff for browsersec.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/440375 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [17:31:37] (03PS16) 10Ottomata: SSL for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) [17:32:05] (03CR) 10Ottomata: [C: 032] "No op in prod" [puppet] - 10https://gerrit.wikimedia.org/r/440162 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [17:32:27] (03PS4) 10Vgutierrez: vcl: disable show_diff for browsersec.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/440375 (https://phabricator.wikimedia.org/T192555) [17:34:44] (03CR) 10Vgutierrez: [C: 032] vcl: disable show_diff for browsersec.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/440375 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [17:34:56] !log rolling restart of elasticsearch codfw completed - T194245 [17:34:59] (03PS5) 10Vgutierrez: vcl: disable show_diff for browsersec.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/440375 (https://phabricator.wikimedia.org/T192555) [17:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:00] T194245: Implement searching of 'depicts' on commons with the 'quantity' qualifier - https://phabricator.wikimedia.org/T194245 [17:35:13] < 5h for an elasticsearch cluster restart, I think that's a new record... [17:35:42] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={container_status,create_container,image_status,podsandbox_status,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:35:59] gehel: !!! [17:36:31] (03PS1) 10Ottomata: Enable TLS MirrorMaker consumer for main-eqiad -> main-codfw MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440378 (https://phabricator.wikimedia.org/T196081) [17:36:43] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:41:42] (03PS1) 10Ottomata: Add kafka_main-codfw_broker dummy certs for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/440379 [17:41:54] (03CR) 10Ottomata: [V: 032 C: 032] Add kafka_main-codfw_broker dummy certs for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/440379 (owner: 10Ottomata) [17:45:04] (03CR) 10Ottomata: [C: 032] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/440378 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [17:51:04] (03PS1) 10Vgutierrez: browsersec: Fix italian translation [puppet] - 10https://gerrit.wikimedia.org/r/440380 (https://phabricator.wikimedia.org/T192555) [17:52:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties [17:53:25] (03CR) 10Ema: [C: 031] browsersec: Fix italian translation [puppet] - 10https://gerrit.wikimedia.org/r/440380 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [17:53:41] (03CR) 10Vgutierrez: [C: 032] browsersec: Fix italian translation [puppet] - 10https://gerrit.wikimedia.org/r/440380 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [18:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180614T1800). [18:00:05] subbu and hauskatze: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:08] o/ [18:00:10] (03PS1) 10Ottomata: Move kafka_mirror_maker ssl path to /etc/kafka/mirror/ssl [puppet] - 10https://gerrit.wikimedia.org/r/440381 (https://phabricator.wikimedia.org/T196081) [18:00:14] o/ [18:00:34] o/ I can SWAT. [18:01:52] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440362 (https://phabricator.wikimedia.org/T195263) (owner: 10Subramanya Sastry) [18:02:26] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11522/kafka2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/440381 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [18:02:29] (03PS2) 10Ottomata: Move kafka_mirror_maker ssl path to /etc/kafka/mirror/ssl [puppet] - 10https://gerrit.wikimedia.org/r/440381 (https://phabricator.wikimedia.org/T196081) [18:02:31] (03CR) 10Ottomata: [V: 032 C: 032] Move kafka_mirror_maker ssl path to /etc/kafka/mirror/ssl [puppet] - 10https://gerrit.wikimedia.org/r/440381 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [18:02:44] Niharika: Can I ad https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/440283 to SWAT? [18:03:22] (03Merged) 10jenkins-bot: Enable RemexHtml on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440362 (https://phabricator.wikimedia.org/T195263) (owner: 10Subramanya Sastry) [18:03:33] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka2001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties [18:03:38] (03CR) 10jenkins-bot: Enable RemexHtml on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440362 (https://phabricator.wikimedia.org/T195263) (owner: 10Subramanya Sastry) [18:03:46] James_F: Sure. [18:03:54] Niharika: Thanks! [18:05:07] subbu: Your change is on mwdebug1002. [18:05:26] (03PS2) 10Niharika29: beta: declare beta sr.wikipedia and beta crh.wikipedia to langlist-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440289 (owner: 10MarcoAurelio) [18:05:32] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440289 (owner: 10MarcoAurelio) [18:06:37] ok .. will verify. [18:07:06] (03PS2) 10Dzahn: DHCP: Add MAC address and netboot entries for backup2001 [puppet] - 10https://gerrit.wikimedia.org/r/439830 (https://phabricator.wikimedia.org/T196477) (owner: 10Papaul) [18:07:09] (03Merged) 10jenkins-bot: beta: declare beta sr.wikipedia and beta crh.wikipedia to langlist-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440289 (owner: 10MarcoAurelio) [18:07:26] Niharika, lgtm and good to go as long as you don't see errors in logs. [18:07:57] (03CR) 10jenkins-bot: beta: declare beta sr.wikipedia and beta crh.wikipedia to langlist-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440289 (owner: 10MarcoAurelio) [18:08:01] Fatalmonitor is pretty silent. [18:08:16] (03PS1) 10Ottomata: Enable SSL for main-codfw -> main-eqiad Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440384 (https://phabricator.wikimedia.org/T196081) [18:08:28] Hauskatze: Your change is on mdebug1002 as well. [18:08:33] testing, if any [18:08:35] (03CR) 10Bearloga: statistics::discovery: re-enable cron job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [18:08:47] (03CR) 10Ottomata: [V: 032 C: 032] Enable SSL for main-codfw -> main-eqiad Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/440384 (https://phabricator.wikimedia.org/T196081) (owner: 10Ottomata) [18:09:27] Niharika: I don't think I can test that on beta [18:09:44] Hauskatze: Alright. I'll sync it out. [18:09:46] mwdebug1002 doesn't display any differences on SiteMatrix [18:09:54] (03PS3) 10Niharika29: Bump ExtensionDistributor default to REL1_31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440283 (owner: 10Legoktm) [18:10:02] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440283 (owner: 10Legoktm) [18:10:03] yep, we can revert if something goes wrong, it's beta after all :P [18:10:17] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable RemexHtml on a few more wikis - T195263 (duration: 01m 00s) [18:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:21] T195263: Do another round of Tidy replacement on May 30th and June 14th (last deploys before final switch) - https://phabricator.wikimedia.org/T195263 [18:10:28] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address and netboot entries for backup2001 [puppet] - 10https://gerrit.wikimedia.org/r/439830 (https://phabricator.wikimedia.org/T196477) (owner: 10Papaul) [18:10:29] subbu: Your change is out^ [18:10:30] (03PS3) 10Dzahn: DHCP: Add MAC address and netboot entries for backup2001 [puppet] - 10https://gerrit.wikimedia.org/r/439830 (https://phabricator.wikimedia.org/T196477) (owner: 10Papaul) [18:11:09] Niharika, ty. [18:11:26] (03Merged) 10jenkins-bot: Bump ExtensionDistributor default to REL1_31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440283 (owner: 10Legoktm) [18:12:22] (03CR) 10jenkins-bot: Bump ExtensionDistributor default to REL1_31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440283 (owner: 10Legoktm) [18:13:28] (03PS2) 10Dzahn: DNS: Add mgmt DNS entries for dns200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/439973 (https://phabricator.wikimedia.org/T196493) (owner: 10Papaul) [18:13:30] Niharika: it's working now [18:13:42] guess scap has rebuild some cdbs etc [18:14:11] !log niharika29@deploy1001 Synchronized langlist-labs: beta: declare beta sr.wikipedia and beta crh.wikipedia to langlist-labs (duration: 00m 58s) [18:14:12] Hauskatze: Interesting. It's live now too. ^ [18:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:34] Niharika: y'know, scap does funny things :) [18:14:47] James_F: Testable? It's on mwdebug1002. [18:14:58] still working, good [18:15:01] thanks! [18:15:06] \o/ You're welcome. [18:15:09] One sec. [18:15:47] Niharika: Yup, works great. [18:16:12] Alrighty. [18:16:32] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt DNS entries for dns200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/439973 (https://phabricator.wikimedia.org/T196493) (owner: 10Papaul) [18:17:16] (03PS2) 10Dzahn: DHCP: Add MAC address and netboot entries for lvs2009 and lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [18:17:37] !log niharika29@deploy1001 Synchronized wmf-config/CommonSettings.php: Bump ExtensionDistributor default to REL1_31 (duration: 00m 57s) [18:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:43] James_F: Done. [18:17:50] Thanks! [18:19:56] (03CR) 10Dzahn: DHCP: Add MAC address and netboot entries for lvs2009 and lvs2010 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [18:20:06] (03CR) 10Dzahn: [C: 04-1] DHCP: Add MAC address and netboot entries for lvs2009 and lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [18:21:47] (03PS2) 10Dzahn: DNS: Add production DNS entries for bast2002 [dns] - 10https://gerrit.wikimedia.org/r/439965 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul) [18:23:43] (03PS1) 10Ottomata: Re-enable job and chagne-prop topic mirroring from main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/440385 (https://phabricator.wikimedia.org/T197254) [18:23:45] (03PS1) 10Ottomata: Re-enable job and chagne-prop topic mirroring from main-eqiad -> main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/440386 (https://phabricator.wikimedia.org/T197254) [18:24:05] (03CR) 10Dzahn: [C: 04-1] "missing the reverse record in 153.80.208.in-addr.arpa" [dns] - 10https://gerrit.wikimedia.org/r/439965 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul) [18:24:27] (03PS2) 10Ottomata: Re-enable job and change-prop topic mirroring from main-eqiad -> main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/440386 (https://phabricator.wikimedia.org/T197254) [18:25:03] (03PS2) 10Ottomata: Re-enable job and change-prop topic mirroring from main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/440385 (https://phabricator.wikimedia.org/T197254) [18:36:43] (03CR) 10Ottomata: statistics::discovery: re-enable cron job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [18:38:40] (03CR) 10Bearloga: statistics::discovery: re-enable cron job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [18:40:41] (03PS4) 10Bearloga: statistics::discovery: re-enable cron job [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) [19:00:04] marxarelli: (Dis)respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180614T1900). Please do the needful. [19:02:13] choo choo [19:03:49] (03PS5) 10Ottomata: statistics::discovery: re-enable cron job [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [19:06:21] * thcipriani trains [19:07:15] (03CR) 10Ottomata: "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [19:11:03] (03PS1) 10Thcipriani: all wikis to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440390 [19:11:05] (03CR) 10Thcipriani: [C: 032] all wikis to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440390 (owner: 10Thcipriani) [19:12:31] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440390 (owner: 10Thcipriani) [19:12:47] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440390 (owner: 10Thcipriani) [19:13:37] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.8 [19:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:47] !log Reenable puppet in cache:misc nodes - T192555 [19:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:52] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [19:24:17] so thcipriani the "rebuilt and synchronized wikiversions files" part, doesnt actually deploy the code everywhere? or? [19:24:36] or it does, and https://tools.wmflabs.org/versions/ just falls behind sometimes [19:24:40] addshore: yeah, wmf.8 has been everywhere since that wikiversion sync [19:24:44] aaah [19:24:49] tool is lagging :) [19:25:24] does the same thing, FWIW: https://gist.github.com/thcipriani/4b5a5e592465fe6c3bd1baa4bfa41aa6 [19:25:32] command line script [19:26:07] works basically the same as the tool, useful for checking, well, versions for dblists like it says :) [19:34:02] (03PS12) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [19:45:13] (03CR) 10Ottomata: "Find me on IRC and let's merge this together. You can then verify that the cronjob works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [19:45:35] (03CR) 10Urbanecm: "@Hashar: Then, feel free to schedule. I still plan to take care about this, but you probably have access to more private wikis than I have" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [19:52:03] (03PS1) 10Valerie: Turn on Page Creation Logging on Test Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440393 (https://phabricator.wikimedia.org/T196400) [19:57:40] (03CR) 10Ottomata: [C: 032] statistics::discovery: re-enable cron job [puppet] - 10https://gerrit.wikimedia.org/r/438125 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [20:04:23] 10Operations, 10Performance-Team, 10Traffic: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#4283942 (10Krinkle) 05Open>03Resolved a:03Krinkle Obsolete per . The cookie has now been removed entirely :) [20:06:24] PROBLEM - nova-compute proc maximum on labvirt1019 is CRITICAL: Return code of 255 is out of bounds [20:06:32] PROBLEM - Check systemd state on labvirt1019 is CRITICAL: Return code of 255 is out of bounds [20:06:39] PROBLEM - Disk space on labvirt1019 is CRITICAL: Return code of 255 is out of bounds [20:07:52] PROBLEM - Check the NTP synchronisation status of timesyncd on labvirt1019 is CRITICAL: Return code of 255 is out of bounds [20:08:06] addshore: ok, there are some new errors from wmf.8 but nothing that I think rises to the level of rollback, so I think we're ready to go with wmf.999 (which is currently staged on deploy1001) [20:08:18] wooooooooo [20:09:09] so, first thing is I'll run a sync to testwiki only. This will generate all the l10n stuff and then move over test wiki at the end. [20:09:23] okay! [20:09:37] 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, and 2 others: Alert when elasticsearch writes are frozen for too long - https://phabricator.wikimedia.org/T193605#4283959 (10debt) 05Open>03Resolved [20:10:23] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/discovery/golden] [20:11:46] !log re-enable and run puppet on text@codfw - T192555 [20:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:51] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [20:12:27] !log thcipriani@deploy1001 Started scap: testwiki to 1.32.0-wmf.999 (multi-content-revisions T196585) and rebuild l10n cache [20:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:31] T196585: Deploy some MCR related patches on test / group0 for an extended period - https://phabricator.wikimedia.org/T196585 [20:17:49] so I got paged for labvirt1019 [20:17:55] which I think is a new host [20:18:11] well scratch that, I'm sure it is a new host :) [20:18:28] I guess being installed by someone now? [20:20:33] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:21:19] !log re-enable and run puppet on text@ulsfo - T192555 [20:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:23] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [20:26:24] thcipriani: i hate l10n rebuild [20:26:54] it is time consuming [20:35:40] (03PS1) 10Andrew Bogott: labtest ldap: use labtestcontrol2003 as the keystone host [puppet] - 10https://gerrit.wikimedia.org/r/440419 [20:37:10] (03CR) 10Andrew Bogott: [C: 032] labtest ldap: use labtestcontrol2003 as the keystone host [puppet] - 10https://gerrit.wikimedia.org/r/440419 (owner: 10Andrew Bogott) [20:41:07] 28 mins down... [20:42:18] (03PS1) 10Smalyshev: Add Lexemes to instant-index set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440420 (https://phabricator.wikimedia.org/T196896) [20:45:07] addshore: still Updating LocalisationCache for 1.32.0-wmf.999 [20:45:14] !log re-enable and run puppet on rest of cache_text (eqiad, eqsin, esams) - T192555 [20:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:18] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [20:45:27] \o/ [20:45:35] 32 mins down, 8 ish to go? :P [20:46:17] addshore: for reference this was Tuesday's https://tools.wmflabs.org/sal/log/AWP1TqtLwY2u4JUTe_FT (also sync-masters just started) [20:47:14] woo! [20:48:53] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [20:50:30] (03CR) 10Bstorm: [C: 031] "I love this idea. Any detractors?" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [20:52:12] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:55:03] 10Operations: Update people.wikimedia.org with the 2018 Wikimedia hackathon group photo - https://phabricator.wikimedia.org/T197268#4284103 (10Framawiki) p:05Triage>03Low [20:56:32] addshore: proxies [20:56:36] woo [20:58:32] PROBLEM - Disk space on mwdebug2001 is CRITICAL: DISK CRITICAL - free space: / 323 MB (0% inode=69%) [20:58:40] ooof [20:59:38] log files I would guess... [21:00:03] perhaps with the extra branch [21:00:17] yeah, we probably also need a cleanup [21:00:52] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 64.61, 23.77, 15.35 [21:01:35] eww [21:01:53] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 30.46, 21.55, 15.14 [21:02:06] (03CR) 10BryanDavis: [C: 031] "I had some complaints on the linked task, but I am ok with this moving forward for grid engine usage." [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [21:03:23] PROBLEM - Disk space on mwdebug2002 is CRITICAL: DISK CRITICAL - free space: / 74 MB (0% inode=69%) [21:04:38] ouch, mwdebug2002 is even tighter [21:06:34] /dev/vda1 41G 39G 0 100% / [21:07:51] mwdebug1001 has a 52 gig disk [21:10:20] (03PS1) 10Framawiki: Update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/440467 (https://phabricator.wikimedia.org/T197268) [21:12:40] !log re-enable and run puppet on cache_upload - T192555 [21:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:45] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [21:18:12] thcipriani: is scap still running? [21:18:17] !log thcipriani@deploy1001 Finished scap: testwiki to 1.32.0-wmf.999 (multi-content-revisions T196585) and rebuild l10n cache (duration: 65m 50s) [21:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:21] T196585: Deploy some MCR related patches on test / group0 for an extended period - https://phabricator.wikimedia.org/T196585 [21:18:22] woo! [21:18:23] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [21:18:24] MaxSem: just finished [21:18:39] lololol [21:18:45] 65 mins, lovely [21:19:01] addshore: testwiki will be very slow due to hhvm cache rebuilding, but it should have your new branch [21:19:05] * thcipriani cleans old branches [21:19:08] *looks* [21:21:37] (03PS1) 10Aaron Schulz: Make mediawiki.org write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 [21:21:39] (03PS1) 10Aaron Schulz: Make all non-test wikis write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440470 [21:21:41] (03PS1) 10Aaron Schulz: Enable prefix routing wildcards for mcrouter purge broadcasting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440471 [21:21:43] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:21:43] (03PS1) 10Aaron Schulz: Use "memcached-mcrouter" as the main cache type for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440472 [21:21:45] (03PS1) 10Aaron Schulz: Only send cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440473 [21:22:09] it is indeed on .999 and also a bit slow! [21:22:44] thcipriani: is there a plan for the rest of group0? or stick to test wiki for tonight? [21:23:30] addshore: was planning for the rest of group0. First cleaning old versions, syncing backports, then group0 (group0 only takes a minute) [21:23:36] ack [21:23:45] * addshore will hang around for a little bit more [21:25:30] it's probably been a super long day with this rollout, eh? :\ [21:25:52] i just rolled over to 14 hours :D [21:26:03] brutal :( [21:26:11] had other stuff to do too, gonna have a smaller day tommorrow [21:26:16] also going to greece next week. soooo. [21:26:26] addshore: <3 [21:26:30] greece is nice [21:27:17] I hear it is good for https://www.wikidata.org/wiki/Lexeme:L4 [21:27:22] !log thcipriani@deploy1001 Pruned MediaWiki: 1.32.0-wmf.1 (duration: 06m 54s) [21:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:10] (03PS1) 10Thcipriani: Group0 to 1.32.0-wmf.999 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440476 (https://phabricator.wikimedia.org/T196585) [21:37:31] MaxSem: your pagetriage change is live on mwdebug1002 for wmf.8 could you check please if you're around? [21:40:22] (03CR) 10Thcipriani: [C: 032] Group0 to 1.32.0-wmf.999 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440476 (https://phabricator.wikimedia.org/T196585) (owner: 10Thcipriani) [21:40:24] in the interim I'll get group0 up-to-date [21:40:41] thcipriani: I'm not sure how to repro, it's can't get any worse if you just push it;) [21:40:47] :) [21:40:52] MaxSem: ok, will go live [21:41:58] (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.999 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440476 (https://phabricator.wikimedia.org/T196585) (owner: 10Thcipriani) [21:42:05] Woo [21:42:14] (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.999 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440476 (https://phabricator.wikimedia.org/T196585) (owner: 10Thcipriani) [21:42:14] MaxSem: I'm guessing it's when you have pagetriage notifications [21:42:18] (03PS1) 10Andrew Bogott: nova-api: allow access to port 8774 for api access [puppet] - 10https://gerrit.wikimedia.org/r/440478 [21:42:26] But yes it can't get more worse than it is [21:43:26] *any worse [21:44:05] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.8/extensions/PageTriage/includes/Hooks.php: [[gerrit:440474|Fix event presentation class names]] T197262 (duration: 00m 52s) [21:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:09] T197262: PHP Fatal Error: Class undefined: PageTriageMarkAsReviewedPresentationModel - https://phabricator.wikimedia.org/T197262 [21:44:17] ^ MaxSem musikanimal live now [21:44:24] (03PS4) 10Andrew Bogott: toollabs: install python{,3}-pymysql on exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [21:44:28] Thanks! [21:45:09] hrm, weird error on merging this cherry-pick https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/440475 [21:45:50] (03CR) 10Andrew Bogott: [C: 032] toollabs: install python{,3}-pymysql on exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [21:45:57] selenium is falky [21:46:23] forced [21:46:43] thanks [21:46:53] addshore: ok, here's the big sync [21:47:52] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.999 T196585 [21:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:56] T196585: Deploy some MCR related patches on test / group0 for an extended period - https://phabricator.wikimedia.org/T196585 [21:48:11] ^ addshore live on group0 [21:48:35] Wooooo [21:54:33] (03PS3) 10Papaul: DHCP: Add MAC address and netboot entries for lvs2009 and lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) [21:56:56] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.999/extensions/PageTriage/includes/Hooks.php: [[gerrit:440475|Fix event presentation class names]] T197262 (duration: 00m 57s) [21:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:00] T197262: PHP Fatal Error: Class undefined: PageTriageMarkAsReviewedPresentationModel - https://phabricator.wikimedia.org/T197262 [22:00:04] kaldari: Time to snap out of that daydream and deploy Page Creation Logging deployment. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180614T2200). [22:02:34] * addshore taps out [22:03:42] (03PS3) 10Papaul: DNS: Add reverse production DNS entries for bast2002 [dns] - 10https://gerrit.wikimedia.org/r/439965 (https://phabricator.wikimedia.org/T196665) [22:17:15] (03PS7) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [22:17:31] (03CR) 10jerkins-bot: [V: 04-1] mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 (owner: 10Krinkle) [22:17:59] (03PS8) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [22:18:00] (03CR) 10jerkins-bot: [V: 04-1] mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 (owner: 10Krinkle) [22:18:32] (03PS9) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [22:18:34] (03PS10) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [22:18:50] (03PS11) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [22:19:22] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [22:21:52] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [22:22:33] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:22:53] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [22:25:40] (03PS24) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [22:26:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [22:36:49] (03PS1) 10Papaul: DHCP: Change backup2001 MAC address from 1G MAC to 10G MAC [puppet] - 10https://gerrit.wikimedia.org/r/440485 (https://phabricator.wikimedia.org/T196477) [22:38:56] (03PS25) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [22:39:50] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [22:44:36] (03PS26) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [22:45:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180614T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:44] I'm the only one with a patch so I'll deploy it myself [23:02:00] RoanKattouw: Unless you want me to? [23:03:19] (03CR) 10Valerie: [C: 032] Turn on Page Creation Logging on Test Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440393 (https://phabricator.wikimedia.org/T196400) (owner: 10Valerie) [23:04:31] RoanKattouw: I'm also deploying a config change. Was supposed to be in the previous hour window, but got delayed. [23:04:43] (03Merged) 10jenkins-bot: Turn on Page Creation Logging on Test Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440393 (https://phabricator.wikimedia.org/T196400) (owner: 10Valerie) [23:05:29] but I can manage it myself [23:06:10] kaldari: I can do that if you want. [23:06:21] kaldari: James_F is deploying, let's have him do it for his first SWAT deployment [23:06:27] sure [23:06:32] it's already merged [23:06:46] just a change to InitialiseSettings.php [23:06:57] OK yeah I see it here in the channel [23:07:20] yay! [23:07:20] I had to do it under a temp gerrit account since my gerrit account is still hosed :( [23:07:34] sorry ^ :( (we're still working on it) [23:07:44] no worries. I know it's hairy [23:07:46] stupid upgrade fallout [23:07:49] upgrades are bad, mmm'kay [23:08:06] (03CR) 10jenkins-bot: Turn on Page Creation Logging on Test Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440393 (https://phabricator.wikimedia.org/T196400) (owner: 10Valerie) [23:08:30] kaldari: Please test, it's on mwdebug1002. [23:08:36] looking... [23:10:37] James_F: Looks good, feel free to sync [23:10:58] kaldari: Kk. [23:11:52] James_F: <3 [23:11:58] thanks for scheduling that for me [23:12:10] legoktm: Always a pleasure. [23:12:38] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T196400 for SWAT (duration: 01m 00s) [23:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:44] T196400: Deploy new page creation log - https://phabricator.wikimedia.org/T196400 [23:12:59] * Reedy eyes James_F [23:13:13] !ADMIN [23:13:13] James_F is deploying? That takes away half the patches for SWAT deploys. I'll be jobless. :P [23:13:38] Niharika: Yeah yeah. [23:13:49] Unfortunately I got a scap error. [23:13:59] mw2216.codfw.wmnet is out of space? [23:14:10] yeah, it was bitching earlier [23:14:11] James_F: works great. Thanks! [23:14:12] tiny / [23:14:19] Reedy: Is there a task? Should I make one. [23:14:22] kaldari: Happy to help. [23:14:33] Looks like thcipriani and addshore didn't file one [23:14:40] Kk, making one now [23:14:58] There was some of the mwdebug hosts in codfw erroring too IIRC [23:15:04] 2216 is a new one :( [23:15:19] it was mwdebug2001 and mwdebug2002 [23:15:50] huh, it was recently reimaged: https://tools.wmflabs.org/sal/production?p=0&q=mw2216&d= [23:15:56] Sorry, no, my mistake. [23:16:04] It was from 2216, to mwdebug2001.codfw.wmnet. [23:16:25] oh so 2216 was just the proxy? [23:16:28] Yeah. [23:17:13] I pruned an old wikiversion, that bought 3GB or so but all the others are < 4 weeks old [23:17:15] Filed as T197275, dumped on RelEng 'cos I don't know how to tag it. [23:17:16] T197275: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275 [23:17:33] thcipriani: I guess having wmf.999 doesn't help. :-) [23:17:43] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4284395 (10Krenair) [23:18:11] evidently not :) [23:18:18] How many old versions are still staged? [23:18:51] wmf.2-wmf.8 [23:19:00] wmf.2 is May 21 [23:19:12] presumably the older just non PHP assets? [23:19:27] mwdebug2001 has a tiny amount of space allocated [23:19:53] I need to check to make sure that the older versions are all partially pruned [23:20:50] doesn't look like it, I can try to clean up some more post-swat, but, each version in total is like 3GB [23:21:01] so it's not going to buy a ton, likely [23:22:51] I think it was about 40GB drive [23:22:54] So it's nearly half MW [23:24:29] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.8/resources/src/mediawiki.rcfilters/styles/mw.rcfilters.less: T195903 SWAT for Roan (duration: 00m 59s) [23:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:33] ah, we should be able to drop more versions, I just looked at the date on disk without thinking about it too deeply [23:24:33] T195903: Reduce vertical space between Watchlist filtering tools and first line of results - https://phabricator.wikimedia.org/T195903 [23:24:49] should be able to remove up-to wmf.4 https://www.mediawiki.org/wiki/MediaWiki_1.32/Roadmap [23:25:08] OK, SWAT complete. [23:25:14] * thcipriani does some cleaning [23:25:51] James_F: congrats on the first SWAT! [23:25:57] thcipriani: Thanks. [23:27:52] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [23:28:52] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [23:31:22] !log thcipriani@deploy1001 Pruned MediaWiki: 1.32.0-wmf.2 (duration: 04m 50s) [23:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:23] hmm, are we still swatting? [23:32:41] it seems that the patch for https://phabricator.wikimedia.org/T196914 could use backporting [23:32:59] James_F: ^ [23:32:59] Eurgh. [23:33:00] D: [23:33:07] MatmaRex: Make me a patch. [23:34:43] hmm, actually. it might already be deployed [23:35:06] wmf/1.32.0-wmf.8 [23:35:07] Yeah. [23:35:28] yeah. yay! crisis averted [23:40:44] !log thcipriani@deploy1001 Pruned MediaWiki: 1.32.0-wmf.3 (duration: 04m 38s) [23:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:32] RECOVERY - Disk space on mwdebug2002 is OK: DISK OK [23:44:03] RECOVERY - Disk space on mwdebug2001 is OK: DISK OK [23:48:24] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4284384 (10thcipriani) I pruned 1.32.0-wmf.{1,2,3} to clear out some space, but s... [23:50:35] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4284384 (10Reedy) /dev/vda1 on mwdebug2002 is 39GB. mwdebug1001 has 49GB